You are on page 1of 11

May 2021

Credit Card Fraud Detection using Machine


Learning
Utkarsh Verma, Madhav Singh, Sumit Kumar Chaudhary, Faizal Khan
Final Year B.Tech, Information Technology
Harcourt Butler Technical University
Kanpur, India
vutkarsh174@gmail.com, singh.20madhav@gmail.com, sumitkumarchaudhary007@gmail.com, faizalkhan1234786@gmail.com

learning, machine learning has been identified as a successful


Abstract— With the drastic upsurge in the online transactions, technique for the detection of fraudulent transactions. A huge
the use of credit cards for payment purpose has also increased.
This means there is more possibility of fraudulent transactions amount of data is transferred during digital transaction
which eventually leads to heavy financial losses. Therefore, processes. This results in a binary result: genuine or fraud one.
banks and other financial institutions support the Within the sample training and testing datasets, features are
progress of credit card fraud detection applications.
These transactions with high suspicion of fraud can be found out
constructed. These are data features- the age of holder and
by analyzing various behaviours of credit card customers by value of amount, and also the origin of the credit card. There
going through their previous transaction history. If any deviation are many features and each feature contributes to different
is observed in the behavior from the available patterns from extents towards the fraud probability. We have used KNN and
transaction history, there is a high chance of fraudulent
transaction. Techniques of Machine Learning are extensively Random Forest algorithms of ML on the total dataset to
used to detect these frauds. In this paper, we have used Local classify the incoming transactions as genuine or fraudulent.
outlier factor, KNN and Random Forest techniques to detect the The behavior or pattern of spending money depends on past
frauds .The performance of the model is evaluated based on the
history of transactions (on features such as location, daily
accuracy, sensitivity, precision and recall.
expenses, transaction time of cardholder). These can be
Keywords— Machine Learning; fraudulent; Credit Card; removing compared with current transaction details for the detection of
outliers-Local Outlier Factor & Isolation Forest; KNN technique; credit card frauds. Deviation from this behavioural pattern,
Random Forest technique;
helps to detect fraudulent transactions accurately. With the
deviation or shift in behavioural data, we used different
I. INTRODUCTION
machine learning techniques to detect the fraud. The model
Nowadays, transactions through credit cards are very thus is used to identify whether the new or incoming
ordinary since it is easy and less time consuming. But a transaction is fraudulent or genuine.
major problem comes with it – fraudulent transactions.
The main problem is-online payment doesn’t even require
Therefore, the detection of such transactions becomes
the presence of real card. Any person with the card details can
imperative.
make such crime (fraudulent transactions). Card holder comes
Credit card fraud detection is a problem that has been
to know about such fraud act, only when the fraud transaction
persisted for a long time as it is strenuous to solve. There are
has been occurred. The sole aim of this project is to develop
many issues associated with it. With the restricted amount of
Credit Card Fraud Detection System using Machine Learning.
data available, it is difficult to find a pattern for the dataset.
Also, there can be lakhs of entries in dataset with handful of Unfortunately, because of confidentiality reasons, original
fraud ones which might fit a pattern of legitimate behaviour. features or information are not present and more background
Also the problem has many restrictions. Firstly, for security information about the data is also hidden. The dataset contains
reasons, datasets are not easily obtainable for public or if only input instance which is numerical in nature, which is the
available, are censored, making the results not accessible. result of a PCA transformation. Features V1,...,V28 are the
Because of this it is challenging to benchmarking for the principal features that are obtained with PCA. The features
models built. Secondly, the betterment of these methods is which aren’t been transformed using PCA are 'Time' and
hampered by the fact that the security reasons impose 'Amount'. Feature 'Class' is the predicted variable which takes
restrictions to exchange of ways and methods in credit card value 1 in case of fraud and0 otherwise. This way we can
fraud detection. Also, the datasets are continuously evolving. apply the techniques from Machine Learning to detect
This makes legit and fraudulent transaction’s behaviours fraudulent transactions. In this paper, we have given solution
different. With the massive a massive and significant boom in to the problem defined above. So let’s see what we have to
the field or area of machine learning show you in our project. Just a little insight-

1
We have used various algorithms with slight modifications to 99.78
give better accuracy than that of other research papers’.
99.76
II. RELATED WORKS
99.74
Numerous papers are focused on detecting fraudulent
99.72
transactions using deep neural networks and other higher
concepts. However, these models are computationally 99.7
expensive and perform better on larger datasets. They are also This Research
beyond the scope of our current knowledge base. These 99.68 Paper
approaches may lead to great results, as we saw in some 99.66 Our Research
papers, but what if same results, or even better, can be
achieved with less amounts of resources? Our main goal is to 99.64 Paper
show that different and simple machine learning algorithms 99.62
can give decent results with appropriate preprocessing.
99.6
A. Credit Card Fraud Detection
99.58
This research paper is not a standard research paper. It
Local Outlier Isolation
proposed an idea as to how should we proceed with our
Factor Forest Tree
project. It provided us with an insight into the problem
statement and suggested a model or architecture as the
solution. E. Credit Card Fraud Detection Using Machine Learning
B. Credit Card Fraud Detection using Machine Learning The KNN algorithm used by this research paper used
and Data Science less number of neighbors for classification which resulted in
This research paper is a standard research paper that we more poor confusion matrix than ours. Also they didn’t
obtained from IEEE website. It provided a valuable insight removed outliers properly from their dataset. We detected
into how to analyze and then visualize the dataset. However, more number of true positive and true negative and less
the graphs used in this research paper were very few than ours. numbers of false positive and false negative which resulted
They have used only three graphs for the purpose of in better accuracy for our model. They have used 4000 data
visualization. So, you can understand they didn’t do proper instances to train their dataset while we used 5000 data
analysis of the dataset. Apart from the common 1D, 2D and instances. The comparison is given below:
3D scatter plots, we have also used histogram, box plot and
whiskers for proper analysis. We also incorporated statistical
properties such as mean, variance and standard deviance for Predicted Predicted
proper analysis of the dataset downloaded from Kaggle. 0 1 0 1
Actual
Actual

Not only that, we also kept in mind to perform 0 3968 12 0 4996 1


Exploratory Data Analysis (or complete in-depth analysis or 4 8
1 1 1 2
visualization) of the dataset. Here also, we plotted five graphs.
So you can see we took care of leaving no stone unturned Their research paper Our research paper
(leaving no scope for ambiguity or inconsistency).
C. Fraud Detection in Credit Card using Machine A comparison of the respective accuracies is:
Learning Techniques
100
This paper didn’t use statistical properties (such as mean,
median and standard deviation) representation for analyses. It 99.9
also didn’t plot graphs like scatter plots, histograms, box plots 99.8 This Research
and whiskers for visualization. Hence, the analyses part was Paper
99.7
not performed well here.
99.6 Our Research
However, it did provide an idea of the algorithms to be
Paper
used. The algorithms used in this paper provided accuracy, 99.5
precision and recall lower than ours. Thus our project is better
99.4
than this papers’.
KNN
D. Early Prediction of Credit Card Fraud Detection using
Isolation Forest Tree and Local Outlier Factor Machine F. Credit Card Fraud Detection using Random Forest
Learning Algorithms
This paper is not from any standard website. The
This provided the idea of cleaning in the project. confusion matrix that showed up in this paper is worse or
However, the algorithms they used were not as efficient as more poor than ours. It maybe because of the difference in
ours because we have used higher value of number of the codes of algorithm (they might have used low value of
neighbors and estimator factor for Local outlier factor and estimator factor in their algorithm for Random Forest). Also
Isolation forest tree respectively. The comparison of their their outlier removal algorithm might not be as efficient and
accuracy with ours is shown as follows: accurate as ours.

2
A comparison of the confusion matrix of their algorithm
and ours is given below:
99.99
99.98
99.97 This Research
99.96 Paper
99.95 Our Research
Paper
99.94
99.93
Random Forest

This research paper

III. METHODOLOGY

The proposed approach in this paper uses the simple and


easy machine learning algorithms to detect anomalous
activities or points in the dataset, called outliers.

The simplest and most basic architecture figure can be


represented by the given diagram or figure:

Our research paper

A comparison of their respective accuracies is below:

100

99.95

99.9 This Research When we take a bird’s eye view (in detail) on a larger
Paper scale by incorporating real life elements, the full architecture
99.85 diagram can be given as follows:
Our Research
99.8 Paper

99.75
Random Forest

G. Credit Card Fraud Detection Using Machine Learning

This is a standard research paper taken from IEEE website.


We used different library than them which made our
confusion matrix of Isolation forest better than theirs. Also our
outlier removal technique might have been better than theirs.
A comparison of the confusion matrix is given below:

4997 0

1 2

3
The materials and methods that we have used in this project After this, we go on to plot a histogram for every column.
are: This is done so as to obtain a graphical representation of the
dataset. This can be used to check whether there are missing
A. Dataset values in the dataset or not. This is done to ensure that there is
In this research the Credit Card Fraud Detection dataset no hamper is the process and the machine learning algorithms
was used, which can be downloaded from Kaggle. can process the dataset smoothly. This histogram shows that
there is a very heavy overlap of genuine and fraud transactions
The dataset contains 31 numerical features out of which 28 throughout the time and there is no clear distinction.
are named as v1-v28 to protect sensitive data. Since some of
the input variables contain financial information, the PCA
transformation of these input variables was performed in
order to keep these data anonymous and confidential. The rest
of features (three columns or features) are Time, Amount and
Class. These features weren’t transformed. Feature "Time"
shows the time gap between first transaction and the every
other transaction in the dataset. Feature "Amount" is the
amount of the transactions made by credit card. Feature
"Class" represents the label, and takes only 2 values: value 1
in case of fraud transaction and 0 otherwise.
Dataset contains 284,807 transactions where 492
transactions were frauds and the rest were genuine.
Considering the numbers, we can see that this dataset is
highly imbalanced, where only 0.173% of transactions are
labeled as frauds. From above histogram, most of the transactions have the
Since distribution ratio of classes plays an important role transaction amount less than 2500 and all of the fraud
in model accuracy and precision, preprocessing of the data is transactions have transaction amount less than 2500. There is
crucial. no fraud transaction of transaction amount greater than 2500.

B. Visualization and Analysis


We plot different graphs to check for inconsistencies in
the dataset and to visually comprehend it:

The above screenshot is a part of the statistical analysis that


we did for our dataset. It clearly states the mean, median and
standard deviation of the dataset.

This graph shows that there are frauds only for the
transactions whose amount is less than 2500. As for the time,
the fraud transactions are evenly distributed throughout time.
Nearly all the transactions are very small and only a limited
few are close to the maximum amount of the transactions.

By looking at the above box plot, we can say that both


fraud and genuine transactions occur all throughout the time
and there is no distinction between them. Hence we need to fix
amount as the mark of approximate demarcation between
fraudulent transactions and genuine transactions.
There is one more box plot that we created. It is given as
follows:

4
This plotted graph shows the moments (of time) at which
transactions were done in less than two days. It can be clearly
observed (from the graph) that the lowest number of transactions
was made at night time and the highest at day time.

From this box plot, we can easily infer that there are no
frauds transactions occur above the transaction amount of
3000. All of the fraud transactions have transaction amount
less than 3000. However, there are many transactions which
have a transaction amount greater than 3000 and all of them
are genuine
After this, we performed Exploratory Data Analysis.
Here, we plot different graphs to check for inconsistencies in
the dataset and to visually comprehend it:

This graph represents the transaction’s amount. Nearly all the


transactions are very small and only a limited few are close to
the maximum amount of the transactions.

In short, we have fully performed visualization and analysis


to our dataset. In short, we can summarize the information
obtained from our dataset as follows:

This graph shows that the number of fraudulent


transactions is much lower than the legitimate ones. The 31 columns of our dataset are represented above.

Out of the total 284807 data instances in our dataset,


284315 instances belong to genuine category (represented as 0)
while 492 data instances belong to fraudulent category. Some
other analyses are represented below:

5
We have used Google Collab platform. It is the most
efficient platform for execution after Jupyter
Notebook platform. The algorithm is written as
follows:

The above two screenshots represent the count, mean and


standard deviation of the amount column of our dataset for
both legitimate and fraud class. The statistical data for genuine
class is represented by the former and that for fraud class is
represented by the latter.

Our dataset has now been visualized, analyzed and


processed. The time and amount column are made standard and
the Class column is take out to ensure fair and proper
evaluation of the model. The data is further refined by a few
ML algorithms from libraries. The data is fit into a model and
the outlier detection module is applied on it. We have used two
algorithms for this- Local outlier factor and Isolation forest.

C. Preprocessing
Feature selection is a universal method, which selects the
variables that are most important for the prediction in the
given dataset. Carefully choosing appropriate features and
removing the less important one can reduce overfitting, The working of the algorithm can be represented by
improve accuracy and reduce training time. Visualization the below graph:
techniques as seen above, can be useful in this process. By
using this technique it has been found out which features are
most important. Furthermore, features that do not contribute
to the total importance of 95% were removed. These are
known as outliers.
The outlier’s removal was done using two algorithms-
Local outlier factor algorithm and Isolation forest algorithm.
After that, we analyzed the result to find the algorithm that
works best for our problem. It was found that Isolation forest
algorithm works better than Local outlier factor algorithm.
The two algorithms are discussed below:
1. Local Outlier Factor: This algorithm is present in
the sklearn. The module in the sklearn package
By comparing the distance values of a data instance or
includes methods and functions that can be used in
sample to that of its neighbours, one can find out
the classification, regression and outlier detection.
instances that show deviation from their neighbours.
This module of Python is absolutely free and is also
These instances are very anomalous and are called
open-source. It is built using NumPy, SciPy and
outliers. Because the dataset is massive or huge, we
matplotlib modules which provide many creative and
used only a small part of it in our process to reduce the
useful ways used for data analysis and other machine
time taken for processing of the project.
learning algorithms. It features various machine
The outcome of the ‘removal of outliers’ with the
learning algorithms and is made to work with the
completely pre-refined dataset is also found out and is
different modern, complex and useful libraries in
present in the results’ part of this paper.
Machine Learning.
It is an Outlier Detection algorithm. 'Local Outlier 2. Isolation Forest Algorithm: This algorithm ‘isolates’
Factor' refers to the anomaly score of instance of the data instances by randomly picking or selecting a
dataset. It measures the deviation of the sample data column and then randomly selecting a split value
instance as compared to its neighbours. between the maximum and minimum values of the
To be clearer, local aspect is given by k-nearest considered feature or column.
neighbors. The distance function is used to estimate This random partitioning of features will produce
the closeness of data point with its neighbours. smaller paths in trees for the anomalous data values

6
and distinguish them from the normal set of the data. IV. PROPOSED SYSTEM
This algorithm recursively generates partitions on the
The Literature Review clearly shows the different
datasets by randomly selecting a feature and then
complex algorithms used by the researchers from all over the
randomly selecting a split value for the feature.
world. Our motive is to use simplest and easiest machine
Arguably, the anomalies need fewer random partitions
learning algorithms for Credit Card Fraud Detection and
to be isolated compared to the so defined normal data
prove that the same or even better accuracy and precision can
points in the dataset. Therefore, the anomalies will be
be achieved using these. So, here we have used two simple
the points that have a shorter path in the tree. Here, we
machine learning algorithms- K-nearest neighbor and
assume the path length is the number of edges
Isolation forest to achieve better accuracy and precision than
traversed from the root node.
the ones used in the other research papers. We also compared
The algorithm can be written as:
the two algorithms as shown in the result section.
1. K-Nearest Neighbour Algorithm: The concept of K-
nearest neighbour is a distance-based Machine Learning
technique. It is a supervised learning algorithm. It is not
only the simplest but also the highly accurate classifier
algorithm. This algorithm is where the result of new
incoming data instance or example is classified based on
K-nearest neighbours’ majority prediction.
The below three aspects are essential in the performance
of the classifier for this algorithm:
a. The distance function used to locate the nearest
neighbors.
b. The method used to find out the category of k-
nearest neighbor.
c. The value of ‘k’ (number of neighbours) used for
the classification of the new data instances.
Amongst all the competitive credit card fraud detection
methods, KNN almost always secures high performance
in evaluation. The best part is that it doesn’t even
assume anything about the training dataset. It only needs
a function to calculate the distance between two points.
In KNN, we classify any incoming transaction by
finding out the nearest neighbor to new incoming
transaction. If the nearest neighbor turns out to be
fraudulent, then the transaction is assigned fraud class.
The working of the algorithm can be represented by The value of K is taken as a small and odd number
the below graph: (typically 1, 3 or 5). Large value of K reduces the effect
of noisy dataset. For this algorithm, there are different
distance functions to find out the distances. For
continuous or regression problems, Euclidean distance is
the best fit for the distance function. For classification
problem, a simple matching coefficient is commonly
used. For multivariate problem, for each attribute
distance is find out and is combined afterwards. We need
to optimize distance metric for better performance. This
technique requires a balanced dataset for training (equal
proportion of genuine and fraudulent transactions). It is a
good technique that can be followed and trusted.
A rough algorithm for KNN can be given as:
a. Divide the dataset into two for training and testing.

Partitioning them randomly produces shorter paths b. Select a value of k.


for anomalies. When the random trees forest c. Determine the distance function to be used.
produces shorter path lengths for some specific
samples or data instances, they have high chances of d. Choose a sample from the testing dataset that need
being anomalies. When these anomalies are found, to be classified and compute the distance to all the
the model can be used to inform about them to the training samples.
concerned authorities. For testing, we will compare e. Sort the distances obtained and take the k-nearest
these algorithms’ respective results to determine their data samples
accuracy and precision.

7
2. Random Forest Algorithm: This is a supervised c. This algorithm works better when we have both
machine learning algorithm. Ensemble learning is a form categorical and continuous (regression) problem.
of machine learning where you combine different types of The random forest algorithm also works better when
algorithms or same algorithm multiple times to devise a dataset has some missing values.
more accurate model. This algorithm combines multiple
algorithms i.e. multiple decision trees, resulting in a
So to sum up, first, we have done the Exploratory Data
group of trees (called forest). This is the reason for its
Analysis on the full dataset. This part is also called
name. This algorithm is better than the single decision
Visualization of the dataset and is done by representing the
trees because it reduces the over-fitting by taking average
dataset using different and various graphs. Then, we have
of the result. It can be used for both regression and
taken out or removed the outliers using two algorithms-
classification problems.
Local outlier factor and Isolation forest algorithms. We also
compared the two and found that Isolation forest gives better
result. After that, we finally used KNN algorithm and
Random forest algorithm to train the model and to predict
whether the transaction is fraud or not. We have also
compared the two algorithms as shown in the result section.
We have evaluated each algorithm using the confusion
matrix that we conceived and have found out accuracy,
precision, recall, etc. It is present in the result section of this
research paper.

V. EVALUATION OF PROJECT
For the evaluation on the algorithms we draw the
confusion matrix for the two used algorithms. A confusion
matrix contains information about actual and predicted
classifications done by a classification model. Performance
of such models is commonly evaluated using the data in the
matrix.
After that, we calculate the accuracy, precision, etc for
the algorithms used using the created confusion matrix.

A rough algorithm for Random forest can be given as:


a. Pick N random data instances from the dataset.
b. Build decision tree for each of the N records.
c. Pick the number of trees that we want in our
algorithm and repeat steps 1 and 2.
d. For classification problem, each tree in the forest
predicts the class to which the new incoming data
instance belongs.
True Positive (TP) is the number of examples from positive
e. Finally, the new incoming data instance is assigned class that are correctly predicted.
to the class that wins the majority vote.
False Positive (FP) is the number of examples from positive
Advantages of using Random forest algorithm: class but are wrongly predicted.
a. The random forest algorithm is not biased, since, False Negative (FN) is the number of examples from
there are multiple trees and each tree is trained on a negative class but are wrongly predicted.
different subset of data. This is why the overall
biasness is very low. True Negative (TN) is the number of examples from
negative class that are correctly predicted.
b. This algorithm is very stable. Even if a new data
instance is introduced in the dataset, the algorithm Four basic metrics/parameters are used in evaluation of the
remains unaffected to a great extent. This is model- True Positive Rate (TPR), True Negative Rate
because, the new data instance may impact one tree, (TNR), False Positive Rate (FPR) and False Negative Rate
but the chances of it affecting all the trees are very
(FNR).
rare.
TPR is also called sensitivity, hit rate and recall. TNR is

8
also called specificity. to make the following observations:
The TPR and TNR evaluate the effectiveness of a
classifier for each class in the binary classification problem.
TPR is the proportion of examples belonging to the positive
class which were correctly predicted as positive. TNR is a
measure of how well a binary classification test correctly
identifies the negative class.

2-D Scatter Plot


Observation: From above two plots it is clear that there
are frauds only on the transactions which have amount
less than 2500.
2. Evaluation of the algorithms used for outliers’ removal
(Local Outlier Factor and Isolation Forest algorithms)
Total number of positive and negative class cases under
test is represented by P and N respectively.
P = TP + FN
N = FP + TN
The performance of the model is evaluated on the
basis of the figures of accuracy, precision, specificity and
sensitivity.

As is clear, Isolation Forest algorithm gives better


results for our model.
3. After removing outliers
Accuracy is the measure of evaluation of the
effectiveness of a classifier by its percentage of correct
predictions.
Sensitivity (Recall) indicates the accuracy on the
total number of positive (fraud) classification.
Specificity indicates the accuracy on the total number
of negative (legitimate) classification.
Precision indicates the accuracy in the total number
of cases classified as fraud (positive).

VI. RESULTS
1. We performed the visualization of our dataset by
representing it through various graphs. We were able

9
we can see that the value of “True Negative” is 4996
which means that out of 4997 points which belong to
class 0, 4996 points are correctly predicted as 0.
Furthermore, from the same confusion matrix, it can be
seen that the “True Positive” is 2 which means that out
of 3 points which belong to class 1, 2 points are correctly
classified as 1.

The final result of KNN algorithm is as follows:


• accuracy: 99.960%
• precision: 66.667%
• recall: 66.667%

5. Evaluation of Random Forest algorithm:

4. Evaluation of K-Nearest Neighbour algorithm:

Confusion Matrix: From the above screenshot,

Predicted
0 1

0 4997 0
Actual

1 1 2

The final result of Random Forest algorithm is as follows:


Confusion Matrix: From the above screenshot, • accuracy: 99.980%
• precision: 100.000%
Predicted
• recall: 66.667%
0 1

0 4996 1
Actual

1 1 2

10
VII. CONCLUSIONS X. REFERENCES

As we saw above, the Isolation forest algorithm gave [1] Munira Ansari, Hashim Malik, Siddhesh Jadhav, Zaiyyan Khan, “Credit
better results than the Local outlier factor algorithm. This Card Fraud Detection”, International Journal of Engineering Research &
Technology (IJERT), NREST - 2021 Conference Proceedings.
means that Isolation forest algorithm detected more outliers
[2] Mr. S P Maniraj, Aditya Saini, Shadab Ahmed, “Credit Card Fraud
for the same number of dataset instances used. Also these Detection using Machine Learning and Data Science”, Sixth
results were far better than those used in other research International Conference on Intelligent Systems Design and Engineering
papers because of the high number of neighbours and value Applications (2019), IEEE.
of estimator factor used by us. [3] Mr. Manohar, Arvind Bedi, Shashank Kumar, “Fraud Detection in
Credit Card using Machine Learning Techniques”, International
As far as the learning algorithms are concerned, we Research Journal of Engineering and Technology (IRJET), Volume: 07
provided both KNN and Random Forest algorithms with a Issue: 04 | Apr 2020.
total of 25000 data instances. Out of this, 20000 data [4] Mr. Arjun K P, Subhash Singh Negi, “Early Prediction of Credit Card
Fraud Detection using Isolation Forest Tree and Local Outlier Factor
instances or examples were used to train the model while the Machine Learning Algorithms”.
rest 5000 instances were used for the testing of the model. [5] Rahul Powar, Rohan Dawkhar, Pratichi, “Credit Card Fraud Detection
For KNN, out of total 5000 testing data instances, 4997 using Machine Learning”, INTERNATIONAL JOURNAL OF
ADVANCE SCIENTIFIC RESEARCH AND ENGINEERING
points belong to the class 0 and 3 points belong to class 1. TRENDS, Vol. 5 Iss. 9, September 2020, Springer.
The confusion matrix clearly shows that our model has [6] Mrs. Indira, Devi Meenakshi, Gayathri, “Credit Card Fraud Detection
performed well despite having very imbalanced dataset. The using Random Forest”, International Research Journal of Engineering
accuracy given by KNN is 99.960%. and Technology (IRJET), Volume: 06 Issue: 03 | Mar 2019.
The Random Forest’s confusion matrix also depicts the [7] Ruttala Sailusha, R. Ramesh, V. Gnaneswar, “ Credit Card Fraud
Detection Using Machine Learning”, Proceedings of the International
fact that it has been more successful in its predictions. The Conference on Intelligent Computing and Control Systems (ICICCS
accuracy given by Random forest is 99.980%. Thus, Random 2020), IEEE.
Forest’s performance is much better than KNN.
Thus, Random Forest gives more accurate predictions is
and also requires less time for both training and testing
phase.
The KNN will give better results with a larger number
of training data, but then, the time taken for testing dataset’s
execution will bear. The use of more complex preprocessing
techniques on the dataset will also help.

VIII. FUTURE ENHANCEMENTS

While hundred-percent accuracy in fraud detection


couldn’t be achieved, we successfully made a model that can,
with enough time and data, get more close to that aim. There
is always scope for improvement as so is true for our project.
The project allows multiple algorithms to be integrated
together and their outcomes are combined to increase the
final accuracy.
This model can be improved further with the use of
more modern algorithms. However, the output given by these
algorithms needs to be in the same format as given by others.
Once this condition is fulfilled, the algorithms are simple to
integrate. This gives a huge extent of versatility to the
project.
The dataset can also be further improved. As already
told, the accuracy of the algorithms increases when the size
of dataset is huge. Hence, more data instances in the dataset
will definitely make the model more accurate in correctly
detecting frauds and also in reducing the number of false
positives, thereby increasing accuracy. However, this
requires support from the banks and other institutions.

IX. ACKNOWLEDGEMENT

This project couldn’t have been possible without the


support of Dr. Anita Yadav, Associate Professor, Computer
Science Department, Harcourt Butler Technical University,
Kanpur. We are extremely grateful to her.

11

You might also like