You are on page 1of 16

70 Informatica Economică vol. 25, no.

1/2021

Credit Card Fraud Detection using Deep Learning Techniques

Oona VOICAN
Ministry of Transport and Infrastructure, Bucharest, Romania
oona.voican@yahoo.com

The objective of this paper is to identify credit card fraud and this topic can be solved with the
help of advanced machine learning and deep learning techniques. Due to the fact that credit
card fraud is a serious worldwide problem, we have chosen to create a model for detecting
imposter scams by using deep neural networks. The purpose is to understand, determine and
learn the normal behavior of the user and more precisely the detection of identity fraud. Each
person has a trading pattern, uses certain operating systems, has a specific time to complete
the transaction and spends large amounts of money usually within a certain time range. Trans-
actions made by a certain user have a certain pattern, which can be identified with the help of
neural networks. Machine learning involves teaching computers to recognize patterns in data
in the same way as our brains do. Deep learning is just a subfield of machine learning that
deals with algorithms inspired by the structure and function of the brain. Deep learning at the
core is the ability to form higher and higher level of abstractions of representations in data and
raw patterns. The data used to train the model is real, and it will be processed using the one-
hot encoding method, so that categorical data/variables can be used by the machine learning
algorithm.
Keywords: Artificial Intelligence, Credit Card Fraud, Deep Learning, Neural Network Model,
User Behavior
DOI: 10.24818/issn14531305/25.1.2021.06

1 Introduction
Machine learning has its foundations be-
tween the 1950s and 1960s, when researchers
men. A human being can easily place a person
in one of the two categories. However, at that
time, a computer needed a series of complex
built a series of perceptrons, also called "arti- rules in order to be able to make this distinc-
ficial brains" at that time. In the late 1950s, tion between sexes.
Frank Rosenblatt suggested a very simplified The perceptron worked by giving it many ex-
mathematical model, called the perceptron, amples of portraits, including haircuts that are
which mimics the recognition process of the more unusual. After being trained, it was
human brain [1]. They were used to explore given a new dataset, with people it had never
the incomprehensible problem of how the hu- seen before, and it was able to classify the sub-
man brain manages to learn. These percep- jects correctly. However, that approach has
trons were trained and instructed to detect dif- disappeared.
ferences between the faces of women and

Fig. 1. The Perceptron


Informatica Economică vol. 25, no. 1/2021 71

From 1970 through the late 1980s, there was command project for smartphones that used
a period called "AI Winter", signifying a low the Android operating system. In 2016, there
interest in research in the field of artificial in- was an increase in accuracy of approximately
telligence. In 1956, the first multi-layer neural 60% for translations performed by Google.
network architecture was created, but with a Nowadays, many companies use deep learn-
small number of neurons. The activation func- ing. Deep Learning finds applications any-
tions used were polynomial and the training where from self-driving cars to fake news de-
methods were statistical. tection to even predicting earthquakes. Face-
In 1980, the first convolutional neural net- book, for example, uses these techniques for
work appeared in Fukushima, inspired by the facial recognition, natural language pro-
way the visual cortex works. Since 1989, the cessing, managing advertisements, and
famous backpropagation algorithm started to providing recommendations. Netflix also uses
be applied by Yann LeCun, which was used to deep learning algorithms to suggest to users,
handwritten zip code recognition [2]. He was certain movies and series based on other peo-
also the one who had the idea to reduce the ple's profiles with the same preferences.
dimensionality behind the concept of autoen-
coder. 1.1 Classification of machine learning
Since the 1990s, neural networks have be- Machine learning is a discipline in which we
come popular and dominant, mainly due to the define a program not by writing it entirely our-
discovery of the backpropagation algorithm selves, but by learning from data. Researchers
for multi-layer neural networks. believe that machine learning is the best way
In 1997, computer scientists Sepp Hochreiter to make progress in the direction of artificial
and Jürgen Schmidhuber created the gradient- intelligence. In this context, machine learning
based method called "Long short-term refers to the identification of patterns and
memory", which was used for language recog- structures and making decisions based on ob-
nition [3]. Around the 2000s, kernel methods servations extracted from the data received
gained interest, largely due to the instability in [4].
neural networks. Around 2010, neural net- Traditional programming requires a computer
works began to become popular again, due to on which to run a program with a dataset, so
the deep learning concept. In 2011, Google that the process provides valid output data, as
invested in artificial intelligence for a voice can be seen in the figure 2.

Inputs Program Results

Fig. 2. Representation of a traditional program (adaptation after [5])

Machine learning requires a set of input data learn from the datasets so that computers learn
and sometimes output data that are interpreted without being explicitly programmed to do so.
by a computer, resulting, thus, in a program. The simplified representation of machine
Machine learning explores algorithms that can learning can be seen in the figure 3.

Inputs Model Results

Fig. 3. Machine learning model (adaptation after [5])


72 Informatica Economică vol. 25, no. 1/2021

In 1959, Arthur Lee Samuel, an American pi- A year earlier, in 1999, a specialized algo-
oneer, in the field of machine learning de- rithm model was developed to predict suspi-
fined it as the “field of study that gives com- cious behavior. The original part used in the
puters the ability to learn without being ex- research is brought by the fact that the model
plicitly programmed”. Kevin Murphy, author is evaluated based on a cost model, while in
of the famous work "Machine Learning: A other studies the evaluation is used based on
Probabilistic Perspective", defined machine prediction rates and error rates [9].
learning as a series of algorithms that auto- In 2000, Wheeler and Aitken developed the
matically detect structures and patterns in da- idea of combining several algorithms in order
tasets [6]. to maximize predictive power. They wrote an
article explaining diagnostic algorithms, prob-
2 Literature Review abilistic curve algorithms, as well as best fit,
There are several methods that can be used to density selection, and negative selection algo-
detect fraudulent transactions which they rithms. Following their extensive investiga-
were proven to be effective such as: decision tion, an adaptive diagnosis algorithm combin-
trees, genetic algorithms, and clustering tech- ing several neighbourhood-based and proba-
niques or neural networks. bilistic algorithms was found to have the best
The idea of similarity tree was developed performance, and these results indicate that an
based on the idea which was used in the case adaptive solution can provide fraud filtering
of decision trees. Similarity trees are recur- and case ordering functions for reducing the
sively defined, starting from labelling nodes number of final-line fraud investigations nec-
with attribute names, labelling margins with essary. These algorithms can be further im-
attribute values that correspond to a required proved by using additional diagnostic algo-
condition, and leaf nodes that represent a pro- rithms in order to make decisions in excep-
portion of the ratio between the number of tional cases or to calculate the level of confi-
transactions that comply with the conditions dence [10].
and the total legitimate transactions [7]. This Regarding the approach to fraud detection us-
graphically illustrated method is very easy to ing clustering techniques, Bolton and Hand
understand and to put into practice. The less suggested in 2002 two techniques to solve the
pleasant part could be the need to check each problem of behavioural fraud [11]. Group
feature of each transaction in order to be cate- analysis helps to identify accounts that behave
gorized. On the other hand, it turned out that differently at a certain time compared to other
the similarity trees had very good results in transactions recorded on the same card at dif-
other types of problems to detect fraud. ferent times in the past. Thus, the different be-
In 2000, an algorithm recommended by P. havior leads to the detection of fraud. Suspi-
Bentley to establish logic-based rules for di- cious transactions cause the accounts from
viding suspicious and unsuspicious transac- which they were debited to be marked. Fol-
tions into different categories was based on lowing this filtering, those cases are investi-
genetic algorithms. The experiment described gated in more detail. It is assumed that if the
in the study contained a database of 4,000 account holder attached to the card has a cer-
credit card transactions and 62 fields. Like the tain behavior and a pattern of making pay-
classification and similarity trees, the dataset ments, it is unlikely that he/she will behave
was divided into training set and test set. All completely differently only once.
types of rules that could emerge from the The approach of breakpoint analysis is differ-
available fields were tested. The best was the ent from that presented by the previous
rule with the highest predictability. The method. Instead of tracking the account at-
method used by them proved to be highly ef- tached to credit frameworks, each transaction
ficient in the field of real estate insurance and is tracked for each credit card separately. An
was considered a possible effective way to de- example of such a suspicious incident would
tect credit card fraud [8].
Informatica Economică vol. 25, no. 1/2021 73

be the sudden trading of a large sum of money 3 Methodology


or a high frequency of use. A major problem we are facing this century,
Another interesting approach to solve the both as far as companies and individuals are
problem of fraudulent transactions is given by concerned, is fraud using cards, especially
neural networks. credit cards.
In 1977, Gilles Dorronsoro and other collabo- Two types of fraud may occur: through the
rators developed an accessible system based physical presence of the card (approximately
on a neural classifier. The data, however, had 27% of all cases) and through the physical ab-
to be grouped according to the type of account sence of the card (approximately 73% of
[12]. Concepts of similar notions were also cases).
found in the studies made by E. Aleskerov and The first category includes frauds at ATMs,
B. Freisleben, in the paper called "Card POS using cloned cards, which contain stolen
Watch", related to a database mining system, data of victims' cards or stolen and lost cards.
which is based on neural networks. Representatives of the European Central Bank
In 1995, K. Leonard spoke about the systems stated in an official report in 2018 that fraud
of expertise used in credit card fraud with the without physical possession of the card has in-
help of the credit card [13]. In 1996, research- creased significantly in direct proportion to
ers Ezawa and Norton applied a technique for the development of online commerce. Fraud-
detecting fraud in the telecommunications in- ulent payments online, by phone or even by
dustry, using Bayesian neural networks [14]. mail represent three quarters of the total.
Four years later, in 2000, Maes and a small With over 270,000 reports, credit card fraud
team applied techniques based on Bayesian was the most common type of identity theft
neural networks to detect bank card frauds last year and more than doubled from 2017 to
[15]. However, regardless of how they are de- 2019. Almost 165 million records containing
tected, the characteristics of the transactions personal data were exposed through data
are the same. The number of real transactions breaches in 2019.
will always be the majority and the model ei- Of the more than 3.2-million fraud cases re-
ther has to deal with this variable distribution ported to the Federal Trade Commission
or the data needs to be rebalanced. (FTC) in 2019, identity theft accounted for
20.33% of cases and was the most-common
type of fraud [16].

300,000
271,823

250,000

200,000
157,715
150,000 124,514 133,096

100,000 74,902
55,553
50,000

0
2014 2015 2016 2017 2018 2019

Fig. 4. Credit card fraud reports in the US by year (author processing based on [16])

Credit card fraud has been steadily increasing the number of reports increasing by 72.4%
over the years, but it exploded in 2019, with from 2018.
74 Informatica Economică vol. 25, no. 1/2021

271,823

215,682

104,699
83,535
58,723
45,564
23,052

Credit card Other Loan or Phone or Bank fraud Employment Government


fraud identity lease fraud utilities or tax- documents
teheft fraud related fraud or benefits
fraud

Fig. 5. Graph of identity fraud reports in 2019 (author processing based on [16])

As we can see in the figure above, in 2019, open new accounts or even obtain loans on be-
credit card fraud is the most common and pop- half of victims.
ular identity theft, accounting for about It has been proven that imposter scams come
33.84% of all existing identity theft reports. from a variety of different services. According
It is well known that thieves can target and ob- to the Consumer Sentinel Network in 2019,
tain the personal information of the masses most consumers affected by fraud were con-
through frauds, use of personal information or tacted by imposter scams by phone. Estimates
even theft. They can use personal information of losses are about $ 493 million due to this
such as birthday, address or PIN to retrieve type of fraud.
from existing bank accounts. They can also

Total $ Lost
600,000,000
493,000,000
500,000,000

400,000,000
325,000,000
300,000,000
226,000,000
200,000,000
136,000,000
87,000,000
100,000,000 51,000,000

Fig. 6. Amount lost by contact method (author processing based on [16])


Informatica Economică vol. 25, no. 1/2021 75

4 Implementing the Solution Linux/Unix, Windows, Mac OS, and Rasp-


Why Deep Learning and Why Now? To an- berry Pi. This language was created by Guido
swer that question, we actually have to go van Rossum in the Netherlands. Python is ac-
back and understand traditional machine tually a successor to the ABC language. Even
learning at its core first. Traditional machine though Guido van Rossum is the main author,
learning algorithms typically try to define a many other people have contributed to the de-
set of rules or features in the data and these are velopment of the language [17].
usually hand engineered and because their Python 3.9.1 is the newest major release of the
hand engineered often tend to be brittle in Python programming language and was re-
practice. The key idea of deep learning is that leased on December 7, 2020. All versions re-
we will need to learn the features just from leased are open-source. Python has a very
raw data, so we’re going to just take a bunch powerful standard library. It also has many
of images of faces and then the deep learning other libraries used in data analysis such as:
algorithm is going to develop some hierar- Pylearn2, Pybrain, MontePython, Pattern,
chical representation of first detecting lines Hebel and MILK. Other more common librar-
and edges in the image using these lines and ies are NumPy, Pandas, PyMongo, Mat-
edges to detect corners and eyes and mid-level plotlib, Scikit-learn. The Python language can
futures like eyes, noses, mouths, ears then be run on several developed media (IDEs)
composing these together to detect higher- such as: Netbeans, Eclipse, Thonny, Py-
level features like jaw lines side of the face Charm. It also provides support for connect-
etc. which then can be used to detect the final ing to databases [17].
face structure. Why now? The fundamental Tabular modelling takes data in the form of a
building blocks of deep learning have existed table (like a spreadsheet or CSV). The objec-
for decades and they are under underlying al- tive is to predict the value in one column based
gorithms for training these models have also on the values in the other columns. The data
existed for many years so why are we studying used to develop the solution are in csv format
this now? "date_frauda.csv" and the table contains
For one data has become much more perva- 151,114 rows and 11 columns. The data rep-
sive. We are living in the age of big data and resent real recorded transactions.
these algorithms are hungry for huge amounts The paper aims to understand and identify pat-
of data to succeed. Secondly, these algorithms terns in normal user behavior and it is vital to
can benefit tremendously from modern GPU use meaningful and relevant features. Thus,
architectures and hardware acceleration that the significant characteristics that we will use
simply did not exist when these algorithms to create the model are user id, value pur-
were developed and finally due to open- chased, device id used, and the source through
sources tool boxes like TensorFlow and which the transaction was made: through ad-
Pytorch building and deploying these models vertisements, by own access, SEO (Search
has become extremely streamlined. Engine Optimization), browser used, gender,
Python language appeared, according to offi- age and IP address. There is also the class fea-
cial documentation, in the early 1990s, and is ture, which decides whether the transaction is
a multi-platform that can run on both fraudulent. The structure of the data used can
be seen in the Figure 7.

Fig. 7. Representation of the data


76 Informatica Economică vol. 25, no. 1/2021

Due to the fact that each person has a certain the one-hot encoding technique from the
pattern of making payments, transactions sklearn.preprocessing library [18].
made by fraudulent users will have a different In the official documentation, provided by
pattern than the ones the model learned by sklearn, the explanation is to encode the cate-
training. Using the significant features pre- gorical integers using a one-of-K-one-of-K
sented above I want to process them so that I scheme. In other words, this is a process by
can use them to create the neural network. Alt- which categorical variables are converted into
hough there are many supervised learning a form that can be used in machine learning
methods (decision trees, random forest), algorithms. To illustrate in a simplified way
which have no problem in using categorical how encoding works, we will assume that we
variables, neural networks need numerical want to transform only the column "Browser
variables. The optimal method to transform used" from categorical variables.
categorical variables into numeric variables is The table 1 shows the initial form to be pro-
cessed.

Table 1. Table illustrating categorical variables


Id device Browser
22058 Chrome
333320 Opera
1359 Safari
150084 Firefox
360585 Chrome
79203 Chrome
31383 Safari
159842 Microsoft Edge

After processing with one-hot encoding, the have appeared for each distinct value existing
table below using binary values resulted. A in the old table, on the column of the search
value of 0 indicates that the value exists, and engine (browser) used.
1 means no value. In addition, new columns

Table 2. Table illustrating the transformation from categorical variables


using one-hot encoding
Id device Chrome Opera Safari Firefox Microsoft Edge

22058 1 0 0 0 0
333320 0 1 0 0 0
1359 0 0 1 0 0
150084 0 0 0 1 0
360585 1 0 0 0 0

79203 1 0 0 0 0

31383 0 0 1 0 0

159842 0 0 0 0 1
Informatica Economică vol. 25, no. 1/2021 77

We can observe that, as the number of distinct transform the analysed data so that the distri-
values of the categorical variables increases, bution has the average value 0 and the stand-
the complexity of the new objects that will ard deviation 1. In our case, this will be done
have a significantly higher number of columns according to the characteristics, thus inde-
also increases. pendent for each column. This step is a very
Returning to the data processed in the project, important one because we want to start from
first, we will divide the data into 2 objects the premise that we use a normal distribution.
such as "pandas.core.frame.DataFrame". X I want to normalize the numerical data be-
contains all the columns in the csv, except the cause I want the values of each column to have
"class" column and Y contains only the the same importance in the analysis. On the
"class" column. example of the studied dataset, we cannot
Next, we will process the X dataframe. For the leave the "user_id" column to have values of
columns whose values were of numeric type, tens of thousands and hundreds of thousands,
we applied the normalization of the data dis- and the "purchase_value" column to have val-
tributions using StandardScaler from the ues of tens or hundreds of units of measure-
package "sklearn.preprocessing". The idea ment. The values modified because of the nor-
used behind the StandardScaler technique is to malization process that can be viewed in the
figure 8.

Fig. 8. Representation of normalized numerical data

Then, like the example presented in the tables in the process of identifying the rules, that any
above, in the project, one-hot encoding was Xi index is more important than a Xi-1 index.
applied for the columns: "device id", "source", Therefore, to avoid creating such rules, we
"SEO", "browser used", " "sex" and "IP ad- chose to use one-hot encoding.
dress". Finally, after these steps, we can see that in-
An alternative to encoding would have been stead of 11 columns, as we had initially,
Label Encoding, but this is not an effective 137970 columns appeared. In the figure below
method, as it provides numbers from 0 to the we can see the first 5 lines. The data is mapped
number of distinct elements. In this way, in as a DataFrame object named X_final.
the training phase, the model could consider,

Fig. 9. Representation of the transformation of categorical variables into binary numbers


78 Informatica Economică vol. 25, no. 1/2021

The next step is to divide the datasets X and the most relevant for solving the classification
Y. X represents the input data, and Y the out- problem will be extracted. These important
put data, labelled, which determines whether features will be denoted x1, x2, ..., x137970. x1 is
the transaction is fraudulent or not. The data is the user id, x2 is the transaction value, x3 is the
divided into X_train, Y_train, X_test and age, x4 is the ip address, and so on. This data
Y_test. This step is done using will form the perceptron input dataset. This is
"train_test_split" from the the basic unit of any neural network. The in-
"sklearn.model_selection" library. As a rule, puts will be multiplied by weights. We will
about three-quarters of the data is used to train note the weights: w1, w2, ...., w137970. In offi-
the model, and a quarter to test the model. In- cial documentation, the bias is marked with b.
itially, we divided the data into 70% for train- The bias is actually a constant, an input value
ing and 30% for testing. However, due to the of each neuron. The output data of the neural
fact that there are 151,112 rows and 137,970 network will be of binary type: 1 or 0.
columns, we did not have a computer power- The idea behind training a neural network is
ful enough to perform the training on the en- to discover those optimal values w1, w2, ....,
tire dataset, so we used 3,000 transactions for w137970, and b. These values must be optimal
training and 2,000 for testing out of the total for the entire dataset and not just for a trans-
of 151,112 transactions. The next step in- action. After processing the data, the neural
volves transforming the X_train, Y_train, network model is created. For this part, we
X_test, and Y_test dataframes into "numpy" need the Keras library. Keras is an open-
strings. source library specialized in neural networks,
written in the Python language. It can run on
5 Creating the Neural Network TensorFlow, R, Microsoft Cognitive Toolkit,
For this stage, we will use deep learning, PlaidML or Theano. Keras contains countless
which is just a modern area in the more gen- blocks of code used in neural networks such
eral discipline of machine learning. The pro- as activation functions, optimizers and layers
cess encountered in deep learning methods to simplify the process of creating a network.
will automatically extract the significant fea- The code is kept on GitHub, and is well docu-
tures from the large volume of data in order to mented. It also provides support for recurrent
make connections. and convolutional neural networks.
A deep learning neural network is essentially Another useful aspect is the possibility for us-
an artificial neural network with several hid- ers to develop models on mobile devices with
den layers, at least three layers, including the Android and iOS operating systems or even
input and output ones. Each layer can have its the web via the JVM. For the model part, we
own number of neurons. Deep learning neural need Sequential, and for the network layers
networks, whether in a linear or nonlinear re- we need Dense and Dropout technique.
lationship, use mathematical functions to ob- In Keras one can create a sequential pattern.
tain output data by processing input data. The Keras defines the sequential model as a linear
model created within the project will obtain stack of layers. When calling the Sequential
probabilities for all output data after passing constructor, we entered a list of two Dense ob-
through all the layers. Each hidden layer in jects, a Dropout object, and three more dense
neural network extract the implicit infor- objects. Dense objects are actually the most
mation of the features. Hidden layers in the basic types of layers and connect each value
neural network capture more and more com- of the input with each value of the output of
plexity with every layer by discovering rela- the layer. In the dense object constructor,
tionships between features in the input. There- there is the input_dim property, which corre-
fore, complex data can be modelled with sponds to the number of columns in the input
fewer units [19]. data size. It is necessary to define it only for
The network will be trained by giving it input the first layer because you will know how to
and output data. The information considered determine the next ones without the need to
Informatica Economică vol. 25, no. 1/2021 79

specify them explicitly. Thus, for the first We use dense layers, because a simple linear
layer we assigned to the input_dim property regression cannot be enough. Thus, we con-
the value 137970, meaning the number of col- catenate different perceptrons into more com-
umns related to the shape of the X_train ob- plex networks. As you can see in the image
ject. Therefore, the first layer will have below, the input data are x1, x2, ..., and the
137970 nodes. The last layer, the output layer, weights we want to train are w11, w12, w21, w22,
will have a single node. etc.

Fig. 10. Complex neural network (using dense layers)

Then, the perceptrons are grouped together, at change randomly some activations to zero at
the same level, connecting the output of one training time. This makes sure all neurons ac-
layer with the input of the next layer. In this tively work toward the output. Simply put,
way, a complete architecture can be created. dropout refers to ignoring units (i.e. neurons)
To get more accurate results, we chose to use during the training phase of certain set of neu-
5 layers, and after the first 2 layers the dropout rons, which is randomly chosen.
was applied. Dropout is a regularization tech- The neural network model will take the form
nique that was introduced in 2012 by Geoffrey below:
E. Hinton et al. [20]. The basic idea is to

Fig. 11. Neural network model

Activation functions are mathematical equa- For the first four layers, I chose to use the
tions that determine the output of a neural net- ReLU activation function. The name is an ac-
work. The function is attached to each neuron ronym for "rectified linear unit" and it is the
in the network, and determines whether it most widely used activation function f(x) =
should be activated (“fired”) or not, based on max (0, x) and gives an output of x if x is pos-
whether each neurons ‘input is relevant for the itive and 0 otherwise.
model's prediction.
80 Informatica Economică vol. 25, no. 1/2021

Fig. 12. Graphical representation of the ReLU function

For the last layer, I use the Sigmoid activation models where we have to predict the probabil-
function, because this function is used for ity as an output (the transaction may or may
not be fraudulent).

Fig. 13. Graphical representation of the Sigmoid function

The Sigmoid function is similar to the step bias term is to allow shifting the activation
function, which means that the output of each function to the left and to the right regardless
perceptron is 0 or 1, but it can take any values of the inputs. The resulting value of the output
in the range [0; 1]. There is a wide range of will be compared with the value of the labels,
sigmoid functions, which also includes hyper- always present in supervised learning. Having
bolic and logistic tangent functions. In gen- the results of the expected output and the cal-
eral, we can consider any function with an "S" culated output we can check if the prediction
shape as sigmoid. The Sigmoid function is made was correct. If the results do not match,
concave for values greater than 0, and convex then we use the backpropagation for the error
for values less than 0. across the neural network. In this way, we can
Feed forward is the process of calculating out- figure out why the error occurred and it can be
put data from input data in a neural network corrected. Thus, we will identify the weights
and it is also the first step in the process called that led to the occurrence of errors, in order to
backpropagation. This step will also make the be rectified. The next feed forward process
prediction for the input dataset. We define a will have a better result than the previous one.
set of inputs to that neuron as x1, X2 ……,xm The process will be repeated several times un-
and each of these inputs have a corresponding til the results are satisfactory. To do this,
weight w1, W2…….,wm. We multiply them where used epochs. An epoch is one complete
correspondingly together and take a sum of all pass through the dataset. Within the project,
of them; then we take this single number that’s the process has been iterated for 5 epochs. The
summation and we pass it through a nonlinear problems that can be encountered are overfit-
activation function and that produces our final ting and underfitting. Overfitting, means that
output y. We also have what’s called a bias the model fits the data very well when we pro-
term (b) in this neuron and the purpose of the vided a set of data containing a lot of features.
Informatica Economică vol. 25, no. 1/2021 81

The model tends to memorize the data, rather accuracy of our machine learning model. Its
than to learn from it, which makes it unable to occurrence simply means that our model or
predict the new data. A statistical model or a the algorithm does not fit the data well
machine learning algorithm is said to have un- enough.
derfitting when it cannot capture the underly-
ing trend of the data. Underfitting destroys the

Fig. 14. Graphical representation of underfitting and overfitting [21]

To avoid overfitting, we decided to use the by the optimizers are related to the learning
Dropout regularization form. The main idea is time of the model. Good results of a model can
to remove neural components (outputs) ran- be obtained in a few minutes, a few hours or a
domly from a layer of the neural network. It is few days, depending on the choice of the op-
similar to the concept of bagging. The purpose timizer. This optimizer is the extension of the
of its use in the paper is to prevent overfitting stochastic gradient descent (SGD). The term
and co-adaptation. Co-adaptation means that gradient descent literally means descending a
some neurons are highly dependent on others. slope to reach the lowest point on that surface.
If those independent neurons receive “bad” in- The goal is to find the value of x for which f
puts, then the dependent neurons can be af- (x) = y is minimal. Jason Brownlee described
fected as well, and ultimately it can signifi- the stochastic gradient descent as an iterative
cantly alter the model performance, which is algorithm that has a random point in a func-
what might happen with overfitting. There- tion as a starting point and which descends un-
fore, the simplest solution to overfitting is til it finds the minimum value allowed by that
dropout, and works as followed: each neuron function. [22]. The learning rate remains con-
in the network is randomly omitted with a stant for each weight update.
probability between 0 and 1. It is used a hid- Returning to the Adam optimizer, it can be
den neuron probability of 0.5. When the neu- used to replace the classic stochastic gradient
rons are eliminated, the neuron inputs de- descent procedure to iteratively improve the
crease as the weights increase to compensate. weights of the data training network. Ad-
After training, the weights will be scaled for a vanced learning specialist Diederik P. Kingma
good network prediction. Then the weights said at a conference in Toronto, in 2015, that
are multiplied by the dropout rate. It is used the name Adam is not actually an acronym,
dropout as a model because many studies has but is "derived from adaptive moment estima-
been shown that it leads to significantly better tion" [23].
losses compared to many other types. The advantages of using the Adam optimizer
are computational efficiency, ease of imple-
6 Model Training and Testing mentation, low memory requirements, diago-
In 2017, machine learning specialist Jason nal invariability of gradient rescaling, effi-
Brownlee published an article on the Adam ciency recorded on large volumes of data.
optimization algorithm. The differences given Compared to the classical stochastic gradient
82 Informatica Economică vol. 25, no. 1/2021

descent this optimizer combines the ad-


vantages of two other gradient descent exten-
sions, namely AdaGrad and RMSProp.

Fig. 15. Neural networks using dropout stochastic regularization [23]

Adam optimizer and binary cross entropy loss one batch at a time to the network. Given that
function were used during model training. The an epoch is one complete pass through the da-
binary cross entropy loss function is used be- taset it will take 3000/15 = 200 batches to con-
cause the result of the model is a binary one tain all the data of an epoch.
(the transaction may or may not be fraudu- In general, the larger the batch size, the faster
lent). the model will complete an epoch during
The 3000 data used in the test set were cov- training.
ered in 5 epochs, with a batch of 15. The batch This is because, depending on the computa-
size refers to the number of samples that will tional resources of the computer, it will be
be sent to the neural network at a given time. able to process multiple samples simultane-
Thus, the 3000 data were divided, segmented ously. However, even if the computer pro-
into 15 batches. An epoch is one complete cesses faster, a larger number of batches could
pass through the dataset. We have 3000 data alter the quality of the model. Thus, depend-
representing the transactions with which we ing on the model result, the batch size can be
want to train the network to identify the type adjusted. Then, calling model.fit, the training
of transaction. The batch size is 15. This will start according to the given specifica-
means that 15 transactions will be grouped in tions.

Fig. 16. Model training

Any optimization problem must have a goal. it is normal to want the loss to be as small as
The goal is to minimize the value of the loss possible. If the model fails to identify the
function. Thus, the optimizer will set the transaction label correctly, at the next itera-
weights so that this loss function has a value tion, it will change the weights so that the
as close as possible to 0. The loss refers to the same data as in the past epoch is closer to a
error obtained by comparing the predicted better result. In this way, the model will learn.
value with the expected one. If the values do After training the model, a high accuracy can
not match, it turns out we have an error. Thus, be seen in the figure below. From the first
Informatica Economică vol. 25, no. 1/2021 83

epoch, where the accuracy was 89%, to the error was very large, namely loss = 43%. In
last epoch, it increased to 95%. Accuracy has the last epoch, the error decreased to the value
increased significantly while loss has de- of loss = 9%.
creased significantly. We can also see that the

Fig. 17. Training results

7 The interpretation of the results means that in the percentage of (3000-7) /


As can be seen in the confusion matrix related 3000 = 99.76% the classification was correct.
to the training model and testing model, were In the model tested, out of 2000 transactions,
recorded some very good results. of which 217 were fraudulent and 1783 were
In the model involved, out of 3000 transac- real, the model correctly identified all 217
tions, of which 306 were fraudulent and 2694 fraudulent ones, and of the real ones identified
were real, the model correctly identified all 1778. This means that it considered 5 non-
306 fraudulent, and of real ones identified fraudulent transactions as fraudulent. This
2687. This means that it considered 7 non- means that in the percentage of (2000-5) /
fraudulent transactions as fraudulent. This 3000 = 99.75% the classification was correct.

Fig. 18., Fig. 19. Confusion matrix of the training and testing model results

The resulting model is a good one, because the can check again manually to confirm or deny
proportion of the correct classification is very the model result. From my point of view, it is
high. It is also important to note that all fraud- much safer for the model to make mistakes
ulent transactions have been correctly identi- considering some transactions as frauds even
fied. In this way, even if certain payments will if they are not, than to consider fraudulent
be falsely categorized as fraudulent, a person transactions as real because, in the latter case,
84 Informatica Economică vol. 25, no. 1/2021

there would be either frauds that would re- case of confirmation of a fraud, the card will
main unidentified, or all transactions that have be blocked, the amount of money will be
been detected as real should be checked again blocked and customers will be satisfied.
by another method, as we cannot be sure We can see that the machine learning tech-
which of them could be possible fraudulent. nique is an excellent tool in terms of infor-
In order to obtain good results, a method was mation processing and data processing, but it
applied to balance the dataset, so that there is does not have creativity, strategic thinking, in-
no disproportion between the number of tuition, value judgment and moral principles.
fraudulent transactions and the number of real That is why we will always need human re-
transactions. The sampling techniques that sources.
have been used are undersampling and over-
sampling using the SMOTE technique that References
creates a new vector between two existing [1] D. Li and Y. Di, Artificial Intelligence with
points. Another option could be to increase the uncertainty. Taylor & Francis, LLC, 2008.
number of layers so that the model is more [2] Y. LeCun, B. Boser, J. S. Denker and D.
complex, or to train over several epochs. Henderson, “Backpropagation Applied to
Handwritten zip code recognition”, Neural
8 Conclusions Computation, vol. 1, Winter 1989, pp.
The model obtained in this paper can be used 541-551.
to be integrated into banks 'mobile applica- [3] S. Hochreiter and J. Schmidhuber, "Long
tions in order to identify fraudulent transac- short-term memory", Neural Computa-
tions and to debit money from victims' credit tion, vol. 9, November 1997, pp. 1735-
accounts. The integration of mobile applica- 1780.
tions with Android and iOS operating systems [4] A. Burkov, The Hundred-Page Machine
is done using the TensorFlow Lite framework. Learning Book. Andriy Burkov, 2019.
Thus, the already trained model can be con- [5] J. Howard and S. Gugger, Deep Learning
verted into a buffer using the converter pro- for Coders with fastai and PyTorch,
vided by the platform maintained by Google. O'Reilly Media, Inc, July 2020.
Then, the compressed file obtained with the [6] K. Murphy, Machine Learning: A Proba-
.tflite extension will be uploaded to the appli- bilistic Perspective. The MIT Press, 2012.
cation on the mobile device. [7] A.I. Kokkinaki, “On Atypical Database
The integration will be done as follows: Transactions: Identification of Probable
within the banking application, which already Frauds using Machine Learning for User
contains details regarding the transactions (the Profiling”, Proc of IEEE Knowledge and
amount traded, the GPS location from which Data Engineering Exchange Workshop;
the transfer is made, the ID of the device from 107-113, 1997.
which the transfer is made, and much more) [8] P. Bentley,”Evolutionary, my dear Wat-
the trained model will be entered. An alterna- son: Investigating Committee-based Evo-
tive would be to create the neural network lution of Fuzzy Rules for the Detection of
from scratch after the process of storing the Suspicious Insurance Claims”, Proc. of
significant features offered by customers GECCO2000.
through their purchases, transactions and pay- [9] W. Fan, M. Miller, S. Stolfo, W. Lee and
ments. Once there is enough information, the P Chan., “Using Artificial Anomalies to
actual training can begin, and then new pre- Detect Unknown and Known Network In-
dictions can easily be made. trusions”, Proceedings of ICDM, pp. 123-
If a transaction is considered suspicious, the 248, 2001.
bank will be able to manually verify the trans- [10] R. Wheeler and S. Aitken, „Multiple Al-
action and in case of irregularities or suspi- gorithms for Fraud Detection”.
cions, will contact the cardholder to confirm Knowledge-Based Systems, 13; 93-99,
the operation before debiting the amount. In 2000.
Informatica Economică vol. 25, no. 1/2021 85

[11] R. Bolton, and D. Hand, „Statistical [18] K. Potdar, T. Pardawala and C. Pai, “A
Fraud Detection: A Review”, Statistical Comparative Study of Categorical Varia-
Science, vol. 17, no.3, pp. 235-249, 2002. ble Encoding Techniques for Neural Net-
[12] J. Dorronsoro, F. Ginel, C. Sanchez and work Classifiers”, International Journal of
C. Cruz, „Neural Fraud Detection in Computer Applications, vol. 175, 7-9,
Credit Card Operations”. IEEE Transac- 2017.
tions on Neural Networks, 8; 827-834, [19]M. Buckmann, „Deep Learning Model
1997. For Bank Crisis Prediction”. Available:
[13] K. Leonard, „The development of a rule https://swisscogni-
based expert system model for fraud alert tive.ch/2020/03/30/deep-learning-model-
in consumer credit”, European Journal of for-bank-crisis-prediction/
Operational Research, 80; 350-356, 1995. [20] G. Hinton, N. Srivastava, A. Krizhev-
[14] K. Ezawa and S. Norton, „Constructing sky, I. Sutskever and R. Salakhutdinov,
Bayesian Networks to Predict Uncollecti- “Improving neural networks by preventing
ble Telecommunications Accounts”, IEEE co-adaptation of feature detectors”, 2012.
Expert, October 1996; 45-51. Available: 1207.0580.pdf (arxiv.org).
[15] S. Maes, K. Tuyls, B. Vanschoenwinkel [21] A. Bhande, “What is underfitting and
and B. Manderick, “Credit Card Fraud De- overfitting in machine learning and how to
tection using Bayesian and Neural Net- deal with it”, 2018. Available: https://me-
works”, Proc. of the 1st International dium.com/greyatom/what-is-underfitting-
NAISO Congress on Neuro Fuzzy Tech- and-overfitting-in-machine-learning-and-
nologies, 2002. how-to-deal-with-it-6803a989c76.
[16] Consumer Sentinel Network, Data Book [22] J. Brownlee, „Better Deep Learning -
2019, Federal Trade Commission, January Train Faster, Reduce Overfitting, and
2020. Available: https://www.ftc.gov/sys- Make Better Predictions”, 2018. Availa-
tem/files/documents/reports/consumer- ble: https://machinelearningmas-
sentinel-neworkdata-book-2019/con- tery.com/better-deep-learning/.
sumer_sentinel_net- [23] D.P. Kingma, J.L. Ba, „Adam: A Method
work_data_book_2019.pdf for Stochastic Optimization”. Published as
[17] P. Vothihong, M. Czyga, I. Idris, M.V. a conference paper at ICLR 2015,
Persson, L.F. Martinez, Python: End-to- https://arxiv.org/pdf/1412.6980.pdf.
end Data Analysis (Learning Path), Packt
Publishing, May 2017.

Oona VOICAN has graduated from the Faculty of Cybernetics, Statistics


and Economic Informatics at the Bucharest University of Economic Studies,
Economic Cybernetics specialization. She followed a master’s degree in
Business Analysis and Business Performance Control with the thesis Data
mining methods used in the banking system. She worked 15 years in the
banking system and now she is a senior adviser at the Ministry of Transpor-
tation, Infrastructure and Communications. Currently, she is a PhD Student
in the field of Economic Informatics. Her scientific fields of interest include Data Mining, Big
Data, Business Intelligence.

You might also like