Professional Documents
Culture Documents
1
We have used various algorithms with slight modifications to 99.78
give better accuracy than that of other research papers’.
99.76
II. RELATED WORKS
99.74
Numerous papers are focused on detecting fraudulent
99.72
transactions using deep neural networks and other higher
concepts. However, these models are computationally 99.7
expensive and perform better on larger datasets. They are also This Research
beyond the scope of our current knowledge base. These 99.68 Paper
approaches may lead to great results, as we saw in some 99.66 Our Research
papers, but what if same results, or even better, can be
achieved with less amounts of resources? Our main goal is to 99.64 Paper
show that different and simple machine learning algorithms 99.62
can give decent results with appropriate preprocessing.
99.6
A. Credit Card Fraud Detection
99.58
This research paper is not a standard research paper. It
Local Outlier Isolation
proposed an idea as to how should we proceed with our
Factor Forest Tree
project. It provided us with an insight into the problem
statement and suggested a model or architecture as the
solution. E. Credit Card Fraud Detection Using Machine Learning
B. Credit Card Fraud Detection using Machine Learning The KNN algorithm used by this research paper used
and Data Science less number of neighbors for classification which resulted in
This research paper is a standard research paper that we more poor confusion matrix than ours. Also they didn’t
obtained from IEEE website. It provided a valuable insight removed outliers properly from their dataset. We detected
into how to analyze and then visualize the dataset. However, more number of true positive and true negative and less
the graphs used in this research paper were very few than ours. numbers of false positive and false negative which resulted
They have used only three graphs for the purpose of in better accuracy for our model. They have used 4000 data
visualization. So, you can understand they didn’t do proper instances to train their dataset while we used 5000 data
analysis of the dataset. Apart from the common 1D, 2D and instances. The comparison is given below:
3D scatter plots, we have also used histogram, box plot and
whiskers for proper analysis. We also incorporated statistical
properties such as mean, variance and standard deviance for Predicted Predicted
proper analysis of the dataset downloaded from Kaggle. 0 1 0 1
Actual
Actual
2
A comparison of the confusion matrix of their algorithm
and ours is given below:
99.99
99.98
99.97 This Research
99.96 Paper
99.95 Our Research
Paper
99.94
99.93
Random Forest
III. METHODOLOGY
100
99.95
99.9 This Research When we take a bird’s eye view (in detail) on a larger
Paper scale by incorporating real life elements, the full architecture
99.85 diagram can be given as follows:
Our Research
99.8 Paper
99.75
Random Forest
4997 0
1 2
3
The materials and methods that we have used in this project After this, we go on to plot a histogram for every column.
are: This is done so as to obtain a graphical representation of the
dataset. This can be used to check whether there are missing
A. Dataset values in the dataset or not. This is done to ensure that there is
In this research the Credit Card Fraud Detection dataset no hamper is the process and the machine learning algorithms
was used, which can be downloaded from Kaggle. can process the dataset smoothly. This histogram shows that
there is a very heavy overlap of genuine and fraud transactions
The dataset contains 31 numerical features out of which 28 throughout the time and there is no clear distinction.
are named as v1-v28 to protect sensitive data. Since some of
the input variables contain financial information, the PCA
transformation of these input variables was performed in
order to keep these data anonymous and confidential. The rest
of features (three columns or features) are Time, Amount and
Class. These features weren’t transformed. Feature "Time"
shows the time gap between first transaction and the every
other transaction in the dataset. Feature "Amount" is the
amount of the transactions made by credit card. Feature
"Class" represents the label, and takes only 2 values: value 1
in case of fraud transaction and 0 otherwise.
Dataset contains 284,807 transactions where 492
transactions were frauds and the rest were genuine.
Considering the numbers, we can see that this dataset is
highly imbalanced, where only 0.173% of transactions are
labeled as frauds. From above histogram, most of the transactions have the
Since distribution ratio of classes plays an important role transaction amount less than 2500 and all of the fraud
in model accuracy and precision, preprocessing of the data is transactions have transaction amount less than 2500. There is
crucial. no fraud transaction of transaction amount greater than 2500.
This graph shows that there are frauds only for the
transactions whose amount is less than 2500. As for the time,
the fraud transactions are evenly distributed throughout time.
Nearly all the transactions are very small and only a limited
few are close to the maximum amount of the transactions.
4
This plotted graph shows the moments (of time) at which
transactions were done in less than two days. It can be clearly
observed (from the graph) that the lowest number of transactions
was made at night time and the highest at day time.
From this box plot, we can easily infer that there are no
frauds transactions occur above the transaction amount of
3000. All of the fraud transactions have transaction amount
less than 3000. However, there are many transactions which
have a transaction amount greater than 3000 and all of them
are genuine
After this, we performed Exploratory Data Analysis.
Here, we plot different graphs to check for inconsistencies in
the dataset and to visually comprehend it:
5
We have used Google Collab platform. It is the most
efficient platform for execution after Jupyter
Notebook platform. The algorithm is written as
follows:
C. Preprocessing
Feature selection is a universal method, which selects the
variables that are most important for the prediction in the
given dataset. Carefully choosing appropriate features and
removing the less important one can reduce overfitting, The working of the algorithm can be represented by
improve accuracy and reduce training time. Visualization the below graph:
techniques as seen above, can be useful in this process. By
using this technique it has been found out which features are
most important. Furthermore, features that do not contribute
to the total importance of 95% were removed. These are
known as outliers.
The outlier’s removal was done using two algorithms-
Local outlier factor algorithm and Isolation forest algorithm.
After that, we analyzed the result to find the algorithm that
works best for our problem. It was found that Isolation forest
algorithm works better than Local outlier factor algorithm.
The two algorithms are discussed below:
1. Local Outlier Factor: This algorithm is present in
the sklearn. The module in the sklearn package
By comparing the distance values of a data instance or
includes methods and functions that can be used in
sample to that of its neighbours, one can find out
the classification, regression and outlier detection.
instances that show deviation from their neighbours.
This module of Python is absolutely free and is also
These instances are very anomalous and are called
open-source. It is built using NumPy, SciPy and
outliers. Because the dataset is massive or huge, we
matplotlib modules which provide many creative and
used only a small part of it in our process to reduce the
useful ways used for data analysis and other machine
time taken for processing of the project.
learning algorithms. It features various machine
The outcome of the ‘removal of outliers’ with the
learning algorithms and is made to work with the
completely pre-refined dataset is also found out and is
different modern, complex and useful libraries in
present in the results’ part of this paper.
Machine Learning.
It is an Outlier Detection algorithm. 'Local Outlier 2. Isolation Forest Algorithm: This algorithm ‘isolates’
Factor' refers to the anomaly score of instance of the data instances by randomly picking or selecting a
dataset. It measures the deviation of the sample data column and then randomly selecting a split value
instance as compared to its neighbours. between the maximum and minimum values of the
To be clearer, local aspect is given by k-nearest considered feature or column.
neighbors. The distance function is used to estimate This random partitioning of features will produce
the closeness of data point with its neighbours. smaller paths in trees for the anomalous data values
6
and distinguish them from the normal set of the data. IV. PROPOSED SYSTEM
This algorithm recursively generates partitions on the
The Literature Review clearly shows the different
datasets by randomly selecting a feature and then
complex algorithms used by the researchers from all over the
randomly selecting a split value for the feature.
world. Our motive is to use simplest and easiest machine
Arguably, the anomalies need fewer random partitions
learning algorithms for Credit Card Fraud Detection and
to be isolated compared to the so defined normal data
prove that the same or even better accuracy and precision can
points in the dataset. Therefore, the anomalies will be
be achieved using these. So, here we have used two simple
the points that have a shorter path in the tree. Here, we
machine learning algorithms- K-nearest neighbor and
assume the path length is the number of edges
Isolation forest to achieve better accuracy and precision than
traversed from the root node.
the ones used in the other research papers. We also compared
The algorithm can be written as:
the two algorithms as shown in the result section.
1. K-Nearest Neighbour Algorithm: The concept of K-
nearest neighbour is a distance-based Machine Learning
technique. It is a supervised learning algorithm. It is not
only the simplest but also the highly accurate classifier
algorithm. This algorithm is where the result of new
incoming data instance or example is classified based on
K-nearest neighbours’ majority prediction.
The below three aspects are essential in the performance
of the classifier for this algorithm:
a. The distance function used to locate the nearest
neighbors.
b. The method used to find out the category of k-
nearest neighbor.
c. The value of ‘k’ (number of neighbours) used for
the classification of the new data instances.
Amongst all the competitive credit card fraud detection
methods, KNN almost always secures high performance
in evaluation. The best part is that it doesn’t even
assume anything about the training dataset. It only needs
a function to calculate the distance between two points.
In KNN, we classify any incoming transaction by
finding out the nearest neighbor to new incoming
transaction. If the nearest neighbor turns out to be
fraudulent, then the transaction is assigned fraud class.
The working of the algorithm can be represented by The value of K is taken as a small and odd number
the below graph: (typically 1, 3 or 5). Large value of K reduces the effect
of noisy dataset. For this algorithm, there are different
distance functions to find out the distances. For
continuous or regression problems, Euclidean distance is
the best fit for the distance function. For classification
problem, a simple matching coefficient is commonly
used. For multivariate problem, for each attribute
distance is find out and is combined afterwards. We need
to optimize distance metric for better performance. This
technique requires a balanced dataset for training (equal
proportion of genuine and fraudulent transactions). It is a
good technique that can be followed and trusted.
A rough algorithm for KNN can be given as:
a. Divide the dataset into two for training and testing.
7
2. Random Forest Algorithm: This is a supervised c. This algorithm works better when we have both
machine learning algorithm. Ensemble learning is a form categorical and continuous (regression) problem.
of machine learning where you combine different types of The random forest algorithm also works better when
algorithms or same algorithm multiple times to devise a dataset has some missing values.
more accurate model. This algorithm combines multiple
algorithms i.e. multiple decision trees, resulting in a
So to sum up, first, we have done the Exploratory Data
group of trees (called forest). This is the reason for its
Analysis on the full dataset. This part is also called
name. This algorithm is better than the single decision
Visualization of the dataset and is done by representing the
trees because it reduces the over-fitting by taking average
dataset using different and various graphs. Then, we have
of the result. It can be used for both regression and
taken out or removed the outliers using two algorithms-
classification problems.
Local outlier factor and Isolation forest algorithms. We also
compared the two and found that Isolation forest gives better
result. After that, we finally used KNN algorithm and
Random forest algorithm to train the model and to predict
whether the transaction is fraud or not. We have also
compared the two algorithms as shown in the result section.
We have evaluated each algorithm using the confusion
matrix that we conceived and have found out accuracy,
precision, recall, etc. It is present in the result section of this
research paper.
V. EVALUATION OF PROJECT
For the evaluation on the algorithms we draw the
confusion matrix for the two used algorithms. A confusion
matrix contains information about actual and predicted
classifications done by a classification model. Performance
of such models is commonly evaluated using the data in the
matrix.
After that, we calculate the accuracy, precision, etc for
the algorithms used using the created confusion matrix.
8
also called specificity. to make the following observations:
The TPR and TNR evaluate the effectiveness of a
classifier for each class in the binary classification problem.
TPR is the proportion of examples belonging to the positive
class which were correctly predicted as positive. TNR is a
measure of how well a binary classification test correctly
identifies the negative class.
VI. RESULTS
1. We performed the visualization of our dataset by
representing it through various graphs. We were able
9
we can see that the value of “True Negative” is 4996
which means that out of 4997 points which belong to
class 0, 4996 points are correctly predicted as 0.
Furthermore, from the same confusion matrix, it can be
seen that the “True Positive” is 2 which means that out
of 3 points which belong to class 1, 2 points are correctly
classified as 1.
Predicted
0 1
0 4997 0
Actual
1 1 2
0 4996 1
Actual
1 1 2
10
VII. CONCLUSIONS X. REFERENCES
As we saw above, the Isolation forest algorithm gave [1] Munira Ansari, Hashim Malik, Siddhesh Jadhav, Zaiyyan Khan, “Credit
better results than the Local outlier factor algorithm. This Card Fraud Detection”, International Journal of Engineering Research &
Technology (IJERT), NREST - 2021 Conference Proceedings.
means that Isolation forest algorithm detected more outliers
[2] Mr. S P Maniraj, Aditya Saini, Shadab Ahmed, “Credit Card Fraud
for the same number of dataset instances used. Also these Detection using Machine Learning and Data Science”, Sixth
results were far better than those used in other research International Conference on Intelligent Systems Design and Engineering
papers because of the high number of neighbours and value Applications (2019), IEEE.
of estimator factor used by us. [3] Mr. Manohar, Arvind Bedi, Shashank Kumar, “Fraud Detection in
Credit Card using Machine Learning Techniques”, International
As far as the learning algorithms are concerned, we Research Journal of Engineering and Technology (IRJET), Volume: 07
provided both KNN and Random Forest algorithms with a Issue: 04 | Apr 2020.
total of 25000 data instances. Out of this, 20000 data [4] Mr. Arjun K P, Subhash Singh Negi, “Early Prediction of Credit Card
Fraud Detection using Isolation Forest Tree and Local Outlier Factor
instances or examples were used to train the model while the Machine Learning Algorithms”.
rest 5000 instances were used for the testing of the model. [5] Rahul Powar, Rohan Dawkhar, Pratichi, “Credit Card Fraud Detection
For KNN, out of total 5000 testing data instances, 4997 using Machine Learning”, INTERNATIONAL JOURNAL OF
ADVANCE SCIENTIFIC RESEARCH AND ENGINEERING
points belong to the class 0 and 3 points belong to class 1. TRENDS, Vol. 5 Iss. 9, September 2020, Springer.
The confusion matrix clearly shows that our model has [6] Mrs. Indira, Devi Meenakshi, Gayathri, “Credit Card Fraud Detection
performed well despite having very imbalanced dataset. The using Random Forest”, International Research Journal of Engineering
accuracy given by KNN is 99.960%. and Technology (IRJET), Volume: 06 Issue: 03 | Mar 2019.
The Random Forest’s confusion matrix also depicts the [7] Ruttala Sailusha, R. Ramesh, V. Gnaneswar, “ Credit Card Fraud
Detection Using Machine Learning”, Proceedings of the International
fact that it has been more successful in its predictions. The Conference on Intelligent Computing and Control Systems (ICICCS
accuracy given by Random forest is 99.980%. Thus, Random 2020), IEEE.
Forest’s performance is much better than KNN.
Thus, Random Forest gives more accurate predictions is
and also requires less time for both training and testing
phase.
The KNN will give better results with a larger number
of training data, but then, the time taken for testing dataset’s
execution will bear. The use of more complex preprocessing
techniques on the dataset will also help.
IX. ACKNOWLEDGEMENT
11