You are on page 1of 69

Credit Card Fraud Detection using Machine Learning

A Project Report Submitted


In Partial Fulfillment of the Requirements
For the Award of the Degree of
Bachelors of Technology
In
Information Technology
By
Utkarsh Verma
(Enrollment No: HBTU170103048)
Madhav Singh
(Enrollment No: HBTU170108016)
Sumit Kr Chaudhary
(Enrollment No: HBTU170101028)
Faizal Khan
(Enrollment No: HBTU170108011)
Under the supervision of
Dr. Anita Yadav
Associate Professor
Department of Computer Science and Engineering,
Harcourt Butler Technical University, Kanpur

To the
School of Engineering
Harcourt Butler Technical University, Kanpur
(Formerly Harcourt Butler Technological Institute, Kanpur)
(May, 2021)

i
ABSTRACT

The recent lockdown caused by the Covid outbreak, witnessed a sudden increase in the
online transactions. As the market, grocery shops and banks were closed, people
extensively used online shopping platforms to carry out the transactions or the transfer of
money. They heavily relied on the credit card transactions. With the drastic upsurge in
the online transactions, the use of credit cards for payment purpose has also increased.
This means there is more possibility of fraudulent transactions which eventually leads to
heavy financial losses. Therefore, banks and other financial institutions support the
progress of credit card fraud detection applications.

These transactions with high suspicion of fraud can be found out by analyzing various
behaviours of credit card customers by going through their previous transaction history.
If any deviation is observed in the behavior from the available patterns from transaction
history, there is a high chance of fraudulent transaction. Techniques of Machine Learning
are extensively used to detect these frauds. In this paper, we have used Local outlier
factor and Isolation Forest for removal of the outliers and KNN and Random Forest
techniques for classification or to detect the frauds. The performance of the model is
evaluated based on the accuracy, sensitivity, precision and recall.

ii
CERTIFICATE

This is to certify that the report presented here, titled “Credit Card Fraud Detection
using Machine Learning”, in partial fulfillment of the requirements for the award of the
degree of Bachelor of Technology and submitted to Computer Science and Engineering
Department of Harcourt Butler Technical University, Kanpur is an authentic record of
work carried out under the guidance of Dr. Anita Yadav, Associate Professor,
Computer Science and Engineering Department, HBTU Kanpur.

The matter embodied in this report has not been submitted by me for the award of any
other degree or diploma.

This is to certify that the above statements made by the candidates are correct to the
best ofour knowledge.

Date: 27-05-2021 Dr. Anita Yadav


Associate Professor
CSE Department
HBTU Kanpur

iii
ACKNOWLEDGEMENT
We would like to take this opportunity to express our sincere gratitude towards our
dignified mentor Dr. Anita Yadav, Associate Professor, Department of Computer
Science and Engineering, HBTU Kanpur for her guidance, patience and high
dedication with which she was involved in this work. We are grateful for the hours that
went in discussing and finalizing even minute details of the work despite her busy
schedule.

We are extremely thankful to Mrs. Priyanka Pandey and all our teachers for their
precious time for evaluation of our work and for their constant encouragement,
suggestions and feedback.

Our sincere gratitude to HBTU in general for providing excellent material and labs
with all the facilities required for the development of this project.

Utkarsh Verma (170103048)


Madhav Singh (170108016)
Sumit Kr Chaudhary (170101028)
Faizal Khan (170108011)

Final B.Tech Information Technology

iv
LIST OF FIGURES
Fig. 1.1: Logistic Regression
Fig. 1.2: KNN in feature space
Fig. 1.3: SVM in feature space
Fig. 1.4: Random Forest
Fig. 1.5: Basic architecture of the project
Fig. 2.1: Local outlier factor & isolation forest comparison with other research papers
Fig. 2.2: KNN performance comparison with other research papers
Fig. 2.3: Random forest confusion matrix comparison with other research papers
Fig. 2.4: Random forest performance comparison with other research papers
Fig. 2.5: Random forest performance comparison with other research papers
Fig. 3.1: Basic architecture of the system
Fig. 3.2: Detailed architecture of the system
Fig. 3.3: Plan of work
Fig. 3.4: Local outlier factor algorithm
Fig. 3.5: Isolation forest algorithm
Fig. 4.1: Dataset obtained from Kaggle
Fig. 4.2: Distribution of ‘time’ and ‘amount’ feature
Fig. 4.3: Fraudulent count versus non-fraudulent count
Fig. 4.4: Number of fraud and genuine transactions
Fig. 4.5: Details of both valid and fraud transactions for ‘Amount’ transaction
Fig. 4.6: 2-D scatter plot
Fig. 4.7: 3-D scatter plot
Fig. 4.8: Histogram for ‘time’ and ‘amount’ feature
Fig. 4.9: Histogram showing distribution of time
Fig. 4.10: Histogram showing probability of fraud cases wrt amount
Fig. 4.11: Box plot
Fig. 4.12: Dataset for analyzing outliers
Fig. 4.13: Box plot for analyzing outliers
Fig. 4.14: Box plot for analyzing outliers
Fig. 4.15: Heat map or correlation matrix

v
Fig. 4.16: Correlation coefficients
Fig. 4.17: Scaled Dataset
Fig. 4.18: Balanced dataset
Fig. 4.19: Count of fraudulent versus non-fraudulent transactions for balanced dataset
Fig. 4.20: Features with negative correlation
Fig. 4.21: Features with positive correlation
Fig. 4.22: Extreme outlier removal for features which are negative correlated
Fig. 4.23: Extreme outlier removal for features which are positive correlated
Fig. 4.24: Graph representing number of optimal features
Fig. 4.25: Dataset with optimal set of features
Fig. 4.26: Correlation matrix for exploratory data analysis
Fig. 4.27: Comparison of performance of local outlier factor and isolation forest
Fig. 4.28: Dataset after removal of outliers
Fig. 4.29: Fraud and valid dataset represented separately
Fig. 4.30: Applying KNN
Fig. 4.31: Confusion matrix and performance of KNN
Fig. 4.32: Confusion matrix and performance of Random Forest
Fig. 5.1: Evaluation of the algorithms used for outliers’ removal (Local Outlier Factor and
Isolation Forest algorithms)
Fig. 5.2: Result of removal of outliers
Fig. 5.3: Evaluation of KNN
Fig. 5.4: Evaluation of Random Forest

vi
LIST OF TABLES
Table 2.1: KNN confusion matrix comparison with other research papers

Table 2.2: Random forest confusion matrix comparison with other research papers

Table 3.1: Dataset obtained from Kaggle

vii
LIST OF ABBREVIATIONS AND SYMBOLS USED
ML – Machine Learning

PCA– Principal Component Analysis

KNN– K-Nearest Neighbour

IEEE– Institute of Electrical and Electronics Engineers

Colab– Colaboratory

viii
CONTENTS
1 Introduction 1
1.1 Overview 2
1.2 Problem statement 2
1.3 Problem solution 3
1.4 Binary classification problem 3
1.5 Basic architecture of the project 6
2 Literature Review 7
2.1 History 8
2.2 Related work 8
3 Work Methodology 13
3.1 Architecture 14
3.2 Plan of work 15
3.3 Tools & Techniques Used 16
3.4 Algorithms used 17
4 Design and Development Details 20
4.1 Visualization and analysis 21
4.2 Analyzing outliers 28
4.3 Standard scaling 32
4.4 Balancing the dataset 33
4.5 Analyzing correlation coefficients on balanced dataset 34
4.6 Extreme outlier removal using box plot 36
4.7 Feature selection 38
4.8 Exploratory data analysis 39
4.9 Advanced outlier removal 40
4.10 Applying KNN 42
4.11 Applying Random forest 46
5 Results and Discussions 48
5.1 Final Product 49
5.2 Comparison with other similar works 52
6 Conclusion and Future Work 56
6.1 Conclusion 57
6.2 Future Scope of Improvement 57
7 References 59

ix
CHAPTER 1

INTRODUCTION

1
1.1 Overview:
Human civilization is based on economics or transactions. And as the civilizations
grew, we developed newer and newer methods of transaction. First there was the
barter system, then the use of currency. We also developed methods to do transactions
by exchanging currency online via banks through various means.

Nowadays, transactions through credit cards are very ordinary since it is easy and less
time consuming. But a major problem comes with it – fraudulent transactions.
Therefore, the detection of such transactions becomes imperative.

Credit card fraud detection is a problem that has been persisted for a long time as it is
strenuous to solve. There are many issues associated with it. With the restricted amount
of data available, it is difficult to find a pattern for the dataset. Also, there can be lakhs
of entries in dataset with handful of fraud ones which might fit a pattern of legitimate
behaviour. Also the problem has many restrictions. Firstly, for security reasons, datasets
are not easily obtainable for public or if available, are censored, making the results not
accessible. Because of this it is challenging to benchmarking for the models built.
Secondly, the betterment of these methods is hampered by the fact that the security
reasons impose restrictions to exchange of ways and methods in credit card fraud
detection. Also, the datasets are continuously evolving. This makes legit and fraudulent
transaction’s behaviours different. With the massive a massive and significant boom in
the field or area of machine learning, it has been identified as a successful technique for
the detection of fraudulent transactions.

1.2 Problem statement


The main problem is-online payment doesn’t even require the presence of real card. Any
person with the card details can make such crime (fraudulent transactions). Card holder
comes to know about such fraud act, only when the fraud transaction has been occurred.
The sole aim of this project is to develop Credit Card Fraud Detection System using
Machine Learning.

2
1.3 Problem solution

A huge amount of data is transferred during digital transaction processes. This results in
a binary result: genuine or fraud one. Unfortunately, because of confidentiality reasons,
original features or information is not present and more background information about
the data is also hidden. That means, within the sample training and testing datasets,
some features are constructed or normalized. So, now the dataset contains only input
instance which is numerical in nature (for some features whose values were string),
which is the result of a PCA transformation. These are data features- name of the holder,
the age of holder and value of amount, and also the origin of the credit card, etc. These
features (V1,…,V28) are the principal features that are obtained with PCA. The features
on which PCA is not applied are ‘Time’, and ‘Amount’. Feature ‘Class’ is the predicted
variable which takes value 1 in case of fraud and 0 otherwise.

There are many features and each feature contributes to different extents towards the
fraud probability. We have used KNN and Random Forest algorithms of ML on the total
dataset to classify the incoming transactions as genuine or fraudulent. The behavior or
pattern of spending money depends on past history of transactions (on features such as
location, daily expenses, transaction time of cardholder). These can be compared with
current transaction details for the detection of credit card frauds. Deviation from this
behavioral pattern helps to detect fraudulent transactions accurately. With the deviation
or shift in behavioral data, we used different machine learning techniques to detect the
fraud. The model thus is used to identify whether the new or incoming transaction is
fraudulent or genuine.

1.4 Binary classification problem

Machine learning is a field of study and is concerned with algorithms that learn from
examples. Classification is a task that requires the use of machine learning algorithms
that learn how to assign a class label to examples from the problem domain. An easy to

3
understand example is classifying emails as “spam” or “not spam.” There are many
different types of classification tasks that we may encounter in machine learning. Here,
we are only concerned with binary classification predictive modeling. Binary
classification refers to predicting or classification tasks that have two class labels.
Examples of include:
1.4.1 Email spam detection (spam or not)
1.4.2 Churn prediction (churn or not)
1.4.3 Conversion prediction (buy or not)
1.4.4 Credit Card Fraud Detection (fraud or not)

The most basic and commonly used form of classification is a binary classification.
Here, the dependent variable comprises two exclusive categories that are denoted
through 1 and 0, hence the term Binary Classification. Often 1 means True and 0 means
False. For example, if the business problem is whether the bank member was able to
repay the loan and we have a feature/variable that says “Loan Defaulter,” then the
response will either be 1 (which would mean True, i.e., Loan defaulter) or 0 (which
would mean False, i.e., Non-Loan Defaulter). This classification has often formed the
basis of various classification algorithms and is the kind of classification technique that
is foremost understood. Some popular algorithms that can be used for binary
classification include:
1.4.1 Logistic Regression
Logistic Regression is used when the dependent variable (target) is categorical. We use activation
functions to classify the data points into either 0 or 1. An activation function, called sigmoid function is
shown below.

Figure 1.1

Data instances or points near 1 will be classified as belonging to class 1 and those near 0 will be

4
classified as belonging to class 0.

1.4.2 K-Nearest Neighbours


K-nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on
a similarity measure (e.g., distance functions). A case is classified by a majority vote of its neighbors, with
the case being assigned to the class most common amongst its K-nearest neighbors measured by a distance
function. If K = 1, then the case is simply assigned to the class of its nearest neighbor. Distance functions
such as Euclidean, Manhattan and Minkowski are often used to as suitable functions to calculate distances.

Figure 1.2

1.4.3 Support Vector Machine (SVM)


The objective of the support vector machine algorithm is to find a hyper-plane in an N-dimensional space(N
— the number of features) that distinctly classifies the data points. To separate the two classes of data
points, there are many possible hyper-planes that could be chosen. Our objective is to find a plane that has
the maximum margin, i.e. the maximum distance between data points of both classes. Maximizing the
margin distance provides some reinforcement so that future data points can be classified with more
confidence. Hyper-planes are decision boundaries that help classify the data points. Data points falling on
either side of the hyper-plane can be attributed to different classes.

Figure 1.3

5
1.4.4 Random Forest
Random forest, like its name implies, consists of a large number of individual decision trees that operate as
an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the
most votes becomes our model’s prediction (see figure below).

Figure 1.4

The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds.

1.5 Basic architecture of the project

Figure 1.5

6
CHAPTER 2

LITERATURE REVIEW

7
2.1 History

The first revolving-credit card with universal merchant acceptance was


the BankAmeriCard, originally issued in 1958 by Bank of America. The card started in
California but grew from there. In 1966, Bank of America expanded its bank card
program by forming the BankAmeriCard Service Corporation, which licensed banks
outside of California and allowed them to issue cards to their customers. By 1969, most
regional banks converted their independent programs to either BankAmeriCard or
Master Charge (now known as VISA and MasterCard, respectively). By 1970, more
than 1,400 banks offered one or the other credit card.

In the early 1970s, when a credit card was used to make a purchase, it was manually
processed through a slide machine, which left an imprint of the credit card number on a
multiple-part receipt. The original copy was for the merchant and the carbon copy was
for the customer. Technological advances led to most credit card sales being handled
electronically via telephone, computer, or the Internet, with the information processed in
a matter of seconds. From the time of manual machines to the modern electronic
processors, credit cards have been used fraudulently.

Numerous agencies have been involved in detecting these frauds through various
methods including the conventional one, i.e. manually. Ever since the advent of Machine
Learning, detecting such fraudulent activities has been a left hand’s play. Several studies
and researches have been conducted to determine as to which machine learning
technique is best suitable for classifying transactions as genuine or fraud.

2.2 Related Works


Numerous papers are focused on detecting fraudulent transactions using deep neural
networks and other higher concepts. However, these models are computationally
expensive and perform better on larger datasets. They are also beyond the scope of our
current knowledge base. These approaches may lead to great results, as we saw in some

8
papers, but what if same results, or even better, can be achieved with less amounts of
resources? Our main goal is to show that different and simple machine learning
algorithms can give decent results with appropriate preprocessing.

The research papers that we used for our study are:

2.2.1 Credit Card Fraud Detection


This research paper is not a standard research paper. It proposed an idea as to how should we proceed with
our project. It provided us with an insight into the problem statement and suggested a model or
architecture as the solution.

2.2.2 Credit Card Fraud Detection using Machine Learning and Data Science
This research paper is a standard research paper that we obtained from IEEE website. It provided a
valuable insight into how to analyze and then visualize the dataset. However, the graphs used in this
research paper were very few than ours. They have used only three graphs for the purpose of visualization.
So, you can understand they didn’t do proper analysis of the dataset. Apart from the common 1D, 2D and
3D scatter plots, we have also used histogram, box plot and whiskers for proper analysis. We also
incorporated statistical properties such as mean, variance and standard deviance for proper analysis of the
dataset downloaded from Kaggle.
Not only that, we also kept in mind to perform Exploratory Data Analysis (or complete in-depth analysis
or visualization) of the dataset. Here also, we plotted five graphs. So you can see we took care of leaving
no stone unturned (leaving no scope for ambiguity or inconsistency).

2.2.3 Fraud Detection in Credit Card using Machine Learning Techniques


This paper didn’t use statistical properties (such as mean, median and standard deviation) representation
for analyses. It also didn’t plot graphs like scatter plots, histograms, box plots and whiskers for
visualization. Hence, the analyses part was not performed well here.
However, it did provide an idea of the algorithms to be used. The algorithms used in this paper provided
accuracy, precision and recall lower than ours. Thus our project is better than this papers’.

2.2.4 Early Prediction of Credit Card Fraud Detection using Isolation Forest Tree and Local
Outlier Factor Machine Learning Algorithms
This provided the idea of cleaning in the project. However, the algorithms they used were not as efficient
as ours because we have used higher value of number of neighbors and estimator factor for Local outlier
factor and Isolation forest tree respectively. The comparison of their accuracy with ours

9
is shown as follows:

99.78
99.76
99.74
99.72
99.7
This Research
99.68 Paper
99.66
Our Research
99.64 Paper
99.62
99.6
99.58
Local Outlier Isolation Forest
Factor Tree

Figure 2.1

2.2.5 Credit Card Fraud Detection Using Machine Learning


The KNN algorithm used by this research paper used less number of neighbors for classification which
resulted in more poor confusion matrix than ours. Also they didn’t removed outliers properly from their
dataset. We detected more number of true positive and true negative and less numbers of false positive
and false negative which resulted in better accuracy for our model. They have used 4000 data instances to
train their dataset while we used 5000 data instances. The comparison of confusion matrix is given below:

Table 2.1

The comparison of accuracies is also given as follows:

100
99.9
99.8 This Research
99.7 Paper

99.6 Our Research


Paper
99.5
99.4
KNN

Figure 2.2

10
2.2.6 Credit Card Fraud Detection using Random Forest
This paper is not from any standard website. The confusion matrix that showed up in this paper is worse
or poorer than ours. It maybe because of the difference in the codes of algorithm (they might have used
low value of estimator factor in their algorithm for Random Forest). Also their outlier removal algorithm
might not be as efficient and accurate as ours. A comparison of the confusion matrix of their algorithm
and ours is given below:

This research paper Our research paper


Figure 2.3

A comparison of their respective accuracies is below:

100

99.95

99.9 This Research


Paper
99.85 Our Research
Paper
99.8

99.75
Random Forest

Figure 2.4

2.2.7 Credit Card Fraud Detection Using Machine Learning


This is a standard research paper taken from IEEE website. We used different library than them which
made our confusion matrix of Isolation forest better than theirs. Also our outlier removal technique might
have been better than theirs. A comparison of the confusion matrix is given below:

11
4997 0

1 2

Table 2.2

The comparison of the accuracies is given below:

99.985
99.98
99.975
99.97
99.965 This Research
99.96 Paper
99.955 Our Research
99.95 Paper
99.945
99.94
99.935
Random Forest

Figure 2.5

12
CHAPTER 3

WORK METHODOLOGY

13
3.1 Architecture

The simplest and most basic architecture figure can be represented by the given diagram
or figure:

Figure 3.1

When we take a bird’s eye view (in detail) on a larger scale by incorporating real life
elements, the full architecture diagram can be given as follows:

Figure 3.2

14
3.2 Plan of Work

Obtain the required Dataset

Visualization and Analysis

Analyzing the presence of


Outliers

Balancing the Dataset

Extreme Outlier removal

Feature Selection

Outlier removal using Local


Outlier Factor and Isolation
Forest algorithm

Selecting the dataset


outputted from the above
algorithm which has higher
accuracy

Apply KNN Apply Random


Forest

Evaluating the Performance


of both the algorithms

Figure 3.3

15
3.3 Tools and technology

3.3.1 Python
Python is an interpreted high-level general-purpose programming language. Python's design philosophy
emphasizes code readability with its notable use of significant indentation. Its language constructs as well
as its object-oriented approach aim to help programmers write clear, logical code for small and large-scale
projects.

The Python language has diversified application in the software development companies such as in
gaming, web frameworks and applications, language development, prototyping, graphic design
applications, etc. This provides the language a higher plethora over other programming languages used in
the industry.

3.3.2 Google Colaboratory


Colab is a free Jupyter notebook environment that runs entirely in the cloud. Most importantly, it does not
require a setup and the notebooks that you create can be simultaneously edited by your team members -
just the way you edit documents in Google Docs. Colab supports many popular machine learning libraries
which can be easily loaded in your notebook.

3.3.3 Dataset
In this research the Credit Card Fraud Detection dataset was used, which can be downloaded from Kaggle.

The dataset contains 31 numerical features out of which 28 are named as v1-v28 to protect sensitive data
and keep it confidential. Since some of the input variables contain financial information, the PCA
transformation of these input variables was performed in order to keep these data anonymous and
confidential. The rest of features (three columns or features) are Time, Amount and Class. These features
weren’t transformed. Feature "Time" shows the time gap between first transaction and the every other
transaction in the dataset. Feature "Amount" is the amount of the transactions made by credit card. Feature
"Class" represents the label, and takes only 2 values: value 1 in case of fraud transaction and 0 otherwise.
Dataset contains 284,807 transactions where 492 transactions were frauds and the rest were genuine.
Considering the numbers, we can see that this dataset is highly imbalanced, where only 0.173% of
transactions are labeled as frauds. Since distribution ratio of classes plays an important role in model
accuracy and precision, preprocessing of the data is crucial.

16
Table 3.1

3.4 Algorithms used


3.4.1 Local Outlier Factor

It is an Outlier Detection algorithm. 'Local Outlier Factor' refers to the anomaly score of instance of the
dataset. It measures the deviation of the sample data instance as compared to its neighbours.
To be clearer, local aspect is given by k-nearest neighbors. The distance function is used to estimate the
closeness of data point with its neighbours.

Figure 3.4

By comparing the distance values of a data instance or sample to that of its neighbours, one can find out
instances that show deviation from their neighbours. These instances are very anomalous and are called
outliers. Because the dataset is massive or huge, we used only a small part of it in our process to reduce
the time taken for processing of the project. The outcome of the ‘removal of outliers’ with the completely
pre-refined dataset is also found out and is present in the results’ part of this paper.

3.4.2 Isolation Forest

This algorithm ‘isolates’ data instances by randomly picking or selecting a column and then randomly
selecting a split value between the maximum and minimum values of the considered feature or column.

17
This random partitioning of features will produce smaller paths in trees for the anomalous data values and
distinguish them from the normal set of the data. This algorithm recursively generates partitions on the
datasets by randomly selecting a feature and then randomly selecting a split value for the feature. Arguably,
the anomalies need fewer random partitions to be isolated compared to the so defined normal data points in
the dataset. Therefore, the anomalies will be the points that have a shorter path in the tree. Here, we assume
the path length is the number of edges traversed from the root node.

Figure 3.5

Partitioning them randomly produces shorter paths for anomalies. When the random trees forest produces
shorter path lengths for some specific samples or data instances, they have high chances of being
anomalies. When these anomalies are found, the model can be used to inform about them to the concerned
authorities.

3.4.3 KNN
The concept of K-nearest neighbour is a distance-based Machine Learning technique. It is a supervised
learning algorithm. It is not only the simplest but also the highly accurate classifier algorithm. This
algorithm is where the result of new incoming data instance or example is classified based on K-nearest
neighbours’ majority prediction. The below three aspects are essential in the performance of the classifier
for this algorithm:

a. The distance function used to locate the nearest neighbors.

b. The method used to find out the category of k-nearest neighbor.

c. The value of ‘k’ (number of neighbours) used for the classification of the new data instances.

Amongst all the competitive credit card fraud detection methods, KNN almost always secures high
performance in evaluation. The best part is that it doesn’t even assume anything about the training dataset.
It only needs a function to calculate the distance between two points.

In KNN, we classify any incoming transaction by finding out the nearest neighbor to new incoming
transaction. If the nearest neighbor turns out to be fraudulent, then the transaction is assigned fraud class.
The value of K is taken as a small and odd number (typically 1, 3 or 5). Large value of K reduces the
effect of noisy dataset. For this algorithm, there are different distance functions to find out the distances.

18
For continuous or regression problems, Euclidean distance is the best fit for the distance function. For
classification problem, a simple matching coefficient is commonly used. For multivariate problem, for
each attribute distance is find out and is combined afterwards. We need to optimize distance metric for
better performance. This technique requires a balanced dataset for training (equal proportion of genuine
and fraudulent transactions). It is a good technique that can be followed and trusted.

A rough algorithm for KNN can be given as:

a. Divide the dataset into two for training and testing.

b. Select a value of k.

c. Determine the distance function to be used.

d. Choose a sample from the testing dataset that need to be classified and compute the distance

to all the training samples.


e. Sort the distances obtained and take the k-nearest data samples.

3.4.4 Random Forest


This is a supervised machine learning algorithm. Ensemble learning is a form of machine learning where
you combine different types of algorithms or same algorithm multiple times to devise a more accurate
model. This algorithm combines multiple algorithms i.e. multiple decision trees, resulting in a group of
trees (called forest). This is the reason for its name. This algorithm is better than the single decision trees
because it reduces the over-fitting by taking average of the result. It can be used for both regression and
classification problems.

A rough algorithm for Random forest can be given as:

a. Pick N random data instances from the dataset.

b. Build decision tree for each of the N records.

c. Pick the number of trees that we want in our algorithm and repeat steps 1 and 2.

d. For classification problem, each tree in the forest predicts the class to which the new
incoming data instance belongs.

e. Finally, the new incoming data instance is assigned to the class that wins the majority vote.

19
CHAPTER 4

DESIGN AND DEVELOPMENT DETAILS

20
4.1 Visualization and analysis
First of all, we start by importing all the libraries that are used in the project. Libraries
were imported are as follows:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.neighbors import LocalOutlierFactor

import sklearn
import scipy

from sklearn.metrics import classification_report,accuracy_score


from sklearn.ensemble import IsolationForest

from sklearn.svm import OneClassSVM


from pylab import rcParams
rcParams['figure.figsize'] = 14, 8
RANDOM_SEED = 42
LABELS = ["Normal", "Fraud"]

from sklearn.metrics import confusion_matrix


from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_validate


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

from matplotlib import gridspec

After this, the dataset was added from the local drive in the computer system to the
Google Colaboratory and was read in the Google Colaboratory as given:

url = "/content/sample_data/creditcard.csv"
creditcard = pd.read_csv(url)
data =creditcard
df = creditcard
creditcard

The output of this code is as shown in the figure that follows:

21
Figure 4.1

Visualization or distribution of ‘time’ feature and ‘amount’ feature:


plt.figure(figsize=(10,8))
plt.title('Distribution of Time Feature')
sns.distplot(df.Time)

Figure 4.2

Plotting Fraudulent count versus Non-fraudulent count:


plt.figure(figsize=(8,6))
sns.barplot(x=counts.index, y=counts)
plt.title('Count of Fraudulent vs. Non-Fraudulent Transactions')
plt.ylabel('Count')
plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)')

22
Figure 4.3

Determining number of fraud cases and valid transactions:

fraud = data[data['Class'] == 1]
valid = data[data['Class'] == 0]
outlierFraction = len(fraud)/float(len(valid))
print(outlierFraction)
print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

Figure 4.4
Details of both valid and fraud transactions for ‘Amount’ transaction:
print('details of valid transaction')
valid.Amount.describe()

print('Amount_details_of_the_fraudulent_transaction')
fraud.Amount.describe()

Figure 4.5

23
Plotting 2-D Scatter plot for both the classes:
sns.set_style("whitegrid")
sns.FacetGrid(creditcard, hue="Class", height = 6).map(plt.scatter, "Time", "Amount").add_legend()
plt.show()

Figure 4.6

3-D Scatter plot was drawn as follows:


plt.close();
sns.set_style("whitegrid");
sns.pairplot(FilteredData, hue="Class", height=5);
plt.show()

Figure 4.7

24
As is clear from the above plots, there are frauds only for the transactions whose amount
is less than 2500. As for the time, the fraud transactions are evenly distributed
throughout time. Nearly all the transactions are very small and only a limited few are
close to the maximum amount of the transactions.

Then, we moved on to plot the histograms for our dataset. They are shown below:
creditCard_genuine = FilteredData.loc[FilteredData["Class"] == 0]
creditCard_fraud = FilteredData.loc[FilteredData["Class"] == 1]

plt.plot(creditCard_genuine["Time"], np.zeros_like(creditCard_genuine["Time"]), "o")


plt.plot(creditCard_fraud["Time"], np.zeros_like(creditCard_fraud["Time"]), "o")

plt.show()

plt.plot(creditCard_genuine["Amount"], np.zeros_like(creditCard_genuine["Amount"]), "o")


plt.plot(creditCard_fraud["Amount"], np.zeros_like(creditCard_fraud["Amount"]), "o")

plt.show()

Time Amount
Figure 4.8

sns.FacetGrid(FilteredData, hue="Class", height=10).map(sns.distplot, "Time").add_legend()


plt.show()

25
Figure 4.9

These histograms show that there is a very heavy overlap of genuine and fraud
transactions throughout the time and there is no clear distinction. It also reveals that
most of the transactions have the transaction amount less than 2500 and all of the fraud
transactions have transaction amount less than 2500. There is no fraud transaction of
transaction amount greater than 2500.

Now, we plotted probability distribution of being fraud versus amount graph as follows:
counts, bin_edges = np.histogram(FilteredData['Amount'], bins=10, density = True)
pdf = counts/(sum(counts))

print("pdf = ",pdf)
print("\n")
print("Counts =",counts)
print("\n")
print("Bin edges = ",bin_edges)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

plt.show();

26
amount
Figure 4.10

Now, we analyzed the various statistical data of ‘Amount’ feature of dataset as follows:
print("Means:")
print("Mean of transaction amount of genuine transactions: ",np.mean(creditCard_genuine["Amount"]))
print("Mean of transaction amount of fraud transactions: ",np.mean(creditCard_fraud["Amount"]))

print("Standard Deviation:")
print("Std-Deviation of transaction amount of genuine transactions: ", np.std(creditCard_genuine["Amount"]))
print("Std-Deviation of transaction amount of fraud transactions: ", np.std(creditCard_fraud["Amount"]))

print("Median:")
print("Median of transaction amount of genuine transactions: ", np.median(creditCard_genuine["Amount"]))
print("Median of transaction amount of fraud transactions: ", np.median(creditCard_fraud["Amount"]))

Next, we moved on to plot the box plots,


sns.boxplot(x = "Class", y = "Time", data = creditcard)
plt.show()

sns.boxplot(x = "Class", y = "Amount", data = creditcard)


plt.ylim(0, 5000)
plt.show()

27
Figure 4.11

By looking at the first box plot, we can say that both fraud and genuine transactions
occur all throughout the time and there is no distinction between them. From the second
box plot, we can easily infer that there are no frauds transactions occur above the
transaction amount of 3000. All of the fraud transactions have transaction amount less
than 3000. However, there are many transactions which have a transaction amount
greater than 3000 and all of them are genuine.

4.2 Analyzing outliers

From the dataset we have, we take out 50,000 data instances for the purpose of
analyzing the presence of outliers.
data = pd.read_csv(url)

data_50000 = data.sample(n=50000)
data_50000

Figure 4.12

28
Applying Local Outlier Factor algorithm to analyze presence of outliers as follows:
lof = LocalOutlierFactor(n_neighbors=5, algorithm='auto', metric='minkowski', p=2, metric_params=None
, contamination=0.5, n_jobs=1)

outlierArray = lof.fit_predict(FinalData)

outlierArray

As is clear, we obtained an array that represent outliers (-1) as output pf the algorithm,
len(outlierArray)

Next, we calculation total number of outliers and inliers from the above array,
countOutliers = 0
countInliers = 0
for i in range(50000):
if outlierArray[i] == -1:
countOutliers += 1
else:
countInliers += 1
print("Total number of outliers = "+str(countOutliers))
print("Total number of inliers = "+str(countInliers))

FinalData2 = FinalData.copy()

FinalData2.shape

Removing the outliers and plotting box plots for V1 and V5 features,
for i in range(50000):
if outlierArray[i] == -1:
FinalData.drop(i, inplace = True)
FinalData.head()

FinalData.shape

29
fig = plt.figure(figsize = (16,6))

plt.subplot(1, 2, 1)
plt.title("Before removing outliers for column V1")
ax = sns.boxplot(x="Class", y = "V1", data= FinalData2, hue = "Class")

plt.subplot(1, 2, 2)
plt.title("After removing outliers for column V1")
ax = sns.boxplot(x="Class", y = "V1", data= FinalData, hue = "Class")

Figure 4.13

fig = plt.figure(figsize = (16,6))

plt.subplot(1, 2, 1)
plt.title("Before removing outliers for column V5")
ax = sns.boxplot(x="Class", y = "V5", data= FinalData2, hue = "Class")

plt.subplot(1, 2, 2)
plt.title("After removing outliers for column V5")
ax = sns.boxplot(x="Class", y = "V5", data= FinalData, hue = "Class")

Figure 4.14

30
Now, we plot a heatmap to show the correlation between various features of the dataset,
corr = df.corr()
plt.figure(figsize=(6,6))
heat = sns.heatmap(data=corr)
plt.title('Heatmap of Correlation')

Figure 4.15

We also tried to find out the correlation coefficients of each feature as follows:
skew_ = df.skew()
skew_

Figure 4.16

31
This shows that ‘time’ feature has the least importance in our dataset, so we can remove
it from our dataset.

4.3 Standard scaling


Now, we move on to scale ‘amount’ and ‘time’ features:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler2 = StandardScaler()

#scaling time
scaled_time = scaler.fit_transform(df[['Time']])
flat_list1 = [item for sublist in scaled_time.tolist() for item in sublist]
scaled_time = pd.Series(flat_list1)

#scaling the amount column


scaled_amount = scaler2.fit_transform(df[['Amount']])
flat_list2 = [item for sublist in scaled_amount.tolist() for item in sublist]
scaled_amount = pd.Series(flat_list2)

#concatenating newly created columns with original 50,000 dataset


df = pd.concat([df, scaled_amount.rename('scaled_amount'), scaled_time.rename('scaled_time')], axis=1)
df

Figure 4.17

32
4.4 Balancing the dataset

The dataset that we have is heavily imbalanced, so we try to balance it as follows:

#how many random samples from normal transactions do we need?


no_of_frauds = train.Class.value_counts()[1]
print('There are {} fraudulent transactions in the train data.'.format(no_of_frauds))

non_fraud = train[train['Class'] == 0]
fraud = train[train['Class'] == 1]
print(fraud)

selected = non_fraud.sample(no_of_frauds)
selected

#concatenating both into a subsample data set with equal class distribution
selected.reset_index(drop=True, inplace=True)
fraud.reset_index(drop=True, inplace=True)
subsample = pd.concat([selected, fraud])
len(subsample)

Figure 4.18

33
Now, we plot a graph to show our balanced dataset’s success as follows:
new_counts = subsample.Class.value_counts()
plt.figure(figsize=(8,6))
sns.barplot(x=new_counts.index, y=new_counts)
plt.title('Count of Fraudulent vs. Non-Fraudulent Transactions In Subsample')
plt.ylabel('Count')
plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)')

Figure 4.19

4.5 Analyzing correlation coefficients on balanced dataset


Now, we go on to study the positive and negative correlations:

#negative correlations smaller than -0.5


corr[corr.Class < -0.5]

#visualizing the features w high negative correlation


f, axes = plt.subplots(nrows=2, ncols=4, figsize=(15,15))

f.suptitle('Features With High Negative Correlation', size=35)


sns.boxplot(x="Class", y="V3", data=subsample, ax=axes[0,0])
sns.boxplot(x="Class", y="V9", data=subsample, ax=axes[0,1])
sns.boxplot(x="Class", y="V10", data=subsample, ax=axes[0,2])
sns.boxplot(x="Class", y="V12", data=subsample, ax=axes[0,3])
sns.boxplot(x="Class", y="V14", data=subsample, ax=axes[1,0])
sns.boxplot(x="Class", y="V16", data=subsample, ax=axes[1,1])
sns.boxplot(x="Class", y="V17", data=subsample, ax=axes[1,2])
sns.boxplot(x="Class", y="V18", data=subsample, ax=axes[1,3])

34
Figure 4.20

#positive correlations greater than 0.5


corr[corr.Class > 0.5]

#visualizing the features w high positive correlation


f, axes = plt.subplots(nrows=2, ncols=2, figsize=(18,9))

f.suptitle('Features With High Positive Correlation', size=20)


sns.boxplot(x="Class", y="V2", data=subsample, ax=axes[0,0])
sns.boxplot(x="Class", y="V4", data=subsample, ax=axes[0,1])
sns.boxplot(x="Class", y="V11", data=subsample, ax=axes[1,0])
f.delaxes(axes[1,1])

Figure 4.21

35
4.6 Extreme outlier removal using box plot
Removing the outliers through box plot method:
#Only removing extreme outliers
Q1 = subsample.quantile(0.25)
Q3 = subsample.quantile(0.75)
IQR = Q3 - Q1

df2 = subsample[~((subsample < (Q1 - 2.5 * IQR)) |(subsample > (Q3 + 2.5 * IQR))).any(axis=1)]

Correlations between features are represented through box plots as given below:

# AFTER OUTLER REMOVAL


f, axes = plt.subplots(nrows=3, ncols=4, figsize=(26,16))

# Negative Correlation
sns.boxplot(x="Class", y="V1", data=df2, ax=axes[0,0])
sns.boxplot(x="Class", y="V3", data=df2, ax=axes[0,1])
sns.boxplot(x="Class", y="V5", data=df2, ax=axes[0,2])
sns.boxplot(x="Class", y="V6", data=df2, ax=axes[0,3])
sns.boxplot(x="Class", y="V7", data=df2, ax=axes[1,0])
sns.boxplot(x="Class", y="V9", data=df2, ax=axes[1,1])
sns.boxplot(x="Class", y="V10", data=df2, ax=axes[1,2])
sns.boxplot(x="Class", y="V12", data=df2, ax=axes[1,3])
sns.boxplot(x="Class", y="V14", data=df2, ax=axes[2,0])
sns.boxplot(x="Class", y="V16", data=df2, ax=axes[2,1])
sns.boxplot(x="Class", y="V17", data=df2, ax=axes[2,2])
sns.boxplot(x="Class", y="V18", data=df2, ax=axes[2,3])

36
Figure 4.22

f, axes = plt.subplots(nrows=2, ncols=2, figsize=(18,9))

# Positive Correlation
sns.boxplot(x="Class", y="V2", data=df2, ax=axes[0,0])
sns.boxplot(x="Class", y="V4", data=df2, ax=axes[0,1])
sns.boxplot(x="Class", y="V11", data=df2, ax=axes[1,0])
f.delaxes(axes[1,1])

37
Figure 4.23

4.7 Feature selection

Now, we perform feature selection technique to select optimal set of features as follows:
from sklearn.model_selection import StratifiedKFold
# Feature Selection using Linear Discriminant Analysis
lda = discriminant_analysis.LinearDiscriminantAnalysis()#SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=lda, step=1, cv=StratifiedKFold(3), scoring='accuracy')
rfecv.fit(X_data, y_target)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores


plt.figure(figsize=(5,5))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

Figure 4.24

38
To find and extract the features that are optimal, we do the following
rfecv.grid_scores_
rfecv.support_
rfecv.ranking_
X_data.columns.values[rfecv.support_]
fdata=df2[['scaled_time','V3', 'V4', 'V7', 'V9', 'V10', 'V11', 'V12', 'V14', 'V16', 'V17',
'V18','scaled_amount','Class']]
fdata

Figure 4.25

4.8 Exploratory data analysis

We perform a final exploratory analysis on the dataset obtained so far


## Correlation
import seaborn as sns
#get correlations of each features in dataset
corrmat = data1.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

39
Figure 4.26

#Create independent and Dependent Features


columns = data1.columns.tolist()
# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we are predicting
target = "Class"
# Define a random state
state = np.random.RandomState(42)
X = data1[columns]
Y = data1[target]
X_outliers = state.uniform(low=0, high=1, size=(X.shape[0], X.shape[1]))
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)

4.9 Advanced outlier removal

We apply Local Outlier Factor and Isolation Forest algorithms and then compare them:

40
#Define the outlier detection methods
classifiers = {
"Isolation Forest":IsolationForest(n_estimators=100, max_samples=len(X),
contamination=outlier_fraction, random_state=state, verbose=0),
"Local Outlier Factor":LocalOutlierFactor(n_neighbors=20, algorithm='auto',
leaf_size=30, metric='minkowski',
p=2, metric_params=None, contamination=outlier_fraction)
}

#Comparing the accuracies of the two algorithms


n_outliers = len(Fraud)
for i, (clf_name,clf) in enumerate(classifiers.items()):
#Fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_prediction = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_prediction = clf.decision_function(X)
y_pred = clf.predict(X)
#Reshape the prediction values to 0 for Valid transactions , 1 for Fraud transactions
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# Run Classification Metrics
print("{}: {}".format(clf_name,n_errors))
print("Accuracy Score :")
print(accuracy_score(Y,y_pred))
print("Classification Report :")
print(classification_report(Y,y_pred))

Figure 4.27

41
Removing the outliers:
y_pred

for i in range(284807):
if y_pred[i]==1:
dd.drop(i,inplace=True)

cldata=dd

Figure 4.28

Printing fraud and valid datasets separately after getting the outliers removed,
Fraud = dd[dd['Class']==1]

Valid = dd[dd['Class']==0]
print(Valid)
print(Fraud)

Figure 4.29

4.10 Applying KNN


After having obtained the most optimal feature set, we first prepare our training and
testing dataset and then we finally apply KNN algorithm as follows:

42
data = cldata
FinalData=data[:25000]

#taking first 25000 samples and the optimal feature set


data_25000 = FinalData
d1=data_25000
d2=d1[['Time','V3', 'V4', 'V7', 'V9', 'V10', 'V11', 'V12', 'V14', 'V16', 'V17',
'V18','Amount','Class']]
d2

from sklearn.preprocessing import StandardScaler


data2=d2
scaler = StandardScaler()
data2['Amount'] = scaler.fit_transform(data2['Amount'].values.reshape(-1, 1))
data2.drop('Time', axis=1, inplace=True)

feature_columns = data2.columns.values.tolist()
feature_columns.remove('Class')
target = 'Class'
data2
d2_Std=data2
d2_labels=data2["Class"]

#taking last 5k points as test data and first 20k points as train data
X1 = d2_Std[0:20000]
XTest = d2_Std[20000:25000]
Y1 = d2_labels[0:20000]
YTest = d2_labels[20000:25000]

# finding best value of 'k'


myList = list(range(0,50))
neighbors = list(filter(lambda x: x%2!=0, myList)) #This will give a list of odd numbers only ranging fro
m 0 to 50

CV_Scores = []

for k in neighbors:
KNN = KNeighborsClassifier(n_neighbors = k, algorithm = 'kd_tree')
scores = cross_val_score(KNN, X1, Y1, cv = 5, scoring='recall')
CV_Scores.append(scores.mean())
CV_Scores

# plotting graph to find best value of k


plt.figure(figsize = (8,8))
plt.plot(neighbors, CV_Scores)
plt.title("Neighbors Vs Recall Score", fontsize=25)
plt.xlabel("Number of Neighbors", fontsize=25)
plt.ylabel("Recall Score", fontsize=25)
plt.grid(linestyle='-', linewidth=0.5)

# finding the best value of k


best_k = neighbors[CV_Scores.index(max(CV_Scores))]
best_k

43
Figure 4.30

44
Drawing the confusion matrix for KNN and finding its accuracy,
from sklearn.metrics import recall_score
KNN_best = KNeighborsClassifier(n_neighbors = best_k, algorithm = 'kd_tree')
KNN_best.fit(X1, Y1)
prediction = KNN_best.predict(XTest)

# printing the confusion matrix


LABELS = ['Normal', 'Fraud']
conf_matrix = confusion_matrix(YTest, prediction)
plt.figure(figsize =(7,7))
sns.heatmap(conf_matrix, xticklabels = LABELS,
yticklabels = LABELS, annot = True, fmt ="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

# evaluating the classifier


from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix
n_outliers = len(fraud)
n_errors = (prediction != YTest).sum()
print("The model used is KNN classifier")

acc = accuracy_score(YTest, prediction)


print("The accuracy is {}".format(acc))

prec = precision_score(YTest, prediction)


print("The precision is {}".format(prec))

rec = recall_score(YTest, prediction)


print("The recall is {}".format(rec))

f1 = f1_score(YTest, prediction)
print("The F1-Score is {}".format(f1))

MCC = matthews_corrcoef(YTest, prediction)


print("The Matthews correlation coefficient is{}".format(MCC))

Figure 4.31

45
4.11 Applying Random forest

After having obtained the most optimal feature set, we first prepare our training and
testing dataset and then we finally apply Random forest algorithm as follows:

#taking last 5k points as test data and first 20k points as train data
X1 = d2_Std[0:20000]
XTest = d2_Std[20000:25000]
Y1 = d2_labels[0:20000]
YTest = d2_labels[20000:25000]

# Building the Random Forest Classifier (RANDOM FOREST)


from sklearn.ensemble import RandomForestClassifier
# random forest model creation
rfc = RandomForestClassifier()
rfc.fit(X1, Y1)
# predictions
yPred = rfc.predict(XTest)

# printing the confusion matrix


LABELS = ['Normal', 'Fraud']
conf_matrix = confusion_matrix(YTest, yPred)
plt.figure(figsize =(7,7))
sns.heatmap(conf_matrix, xticklabels = LABELS,
yticklabels = LABELS, annot = True, fmt ="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

# Evaluating the classifier


from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix

n_outliers = len(fraud)
n_errors = (yPred != YTest).sum()
print("The model used is Random Forest classifier")

acc = accuracy_score(YTest, yPred)


print("The accuracy is {}".format(acc))

prec = precision_score(YTest, yPred)


print("The precision is {}".format(prec))

rec = recall_score(YTest, yPred)


print("The recall is {}".format(rec))

f1 = f1_score(YTest, yPred)
print("The F1-Score is {}".format(f1))

MCC = matthews_corrcoef(YTest, yPred)


print("The Matthews correlation coefficient is{}".format(MCC))

46
Figure 4.32

47
CHAPTER 5

RESULTS AND DISCUSSIONS

48
5.1 Final product
5.1.1 Evaluation of the algorithms used for outliers’ removal (Local Outlier Factor and Isolation
Forest algorithms)

Figure 5.1

As is clear, Isolation Forest algorithm gives better results for our model.

5.1.2 Removal of outliers

49
Figure 5.2

5.1.3 Evaluation of KNN algorithm

Figure 5.3

50
The above confusion matrix is for KNN algorithm. As we can see that the value of “True Negative” is
4996 which means that out of 4997 points which belong to class 0, 4996 points are correctly predicted as
0. Furthermore, from the same confusion matrix, it can be seen that the “True Positive” is 2 which means
that out of 3 points which belong to class 1, 2 points are correctly classified as 1.

The final result of KNN algorithm is as follows:

• accuracy: 99.960%

• precision: 66.667%

• recall: 66.667%

5.1.4 Evaluation of Random Forest algorithm

Figure 5.4

The above confusion matrix is for Random Forest.


The final result of Random Forest algorithm is as follows:

• accuracy: 99.980%

• precision: 100.000%

• recall: 66.667%

51
5.2 Comparison with other similar works

There are numerous studies that have already been conducted in this problem. A
plethora of papers were available online on this problem. The research papers that we
used for our study are:

5.2.1 Credit Card Fraud Detection


This research paper is not a standard research paper. It proposed an idea as to how should we proceed with
our project. It provided us with an insight into the problem statement and suggested a model or
architecture as the solution.

5.2.2 Credit Card Fraud Detection using Machine Learning and Data Science
This research paper is a standard research paper that we obtained from IEEE website. It provided a
valuable insight into how to analyze and then visualize the dataset. However, the graphs used in this
research paper were very few than ours. They have used only three graphs for the purpose of visualization.
So, you can understand they didn’t do proper analysis of the dataset. Apart from the common 1D, 2D and
3D scatter plots, we have also used histogram, box plot and whiskers for proper analysis. We also
incorporated statistical properties such as mean, variance and standard deviance for proper analysis of the
dataset downloaded from Kaggle.
Not only that, we also kept in mind to perform Exploratory Data Analysis (or complete in-depth analysis
or visualization) of the dataset. Here also, we plotted five graphs. So you can see we took care of leaving
no stone unturned (leaving no scope for ambiguity or inconsistency).

5.2.3 Fraud Detection in Credit Card using Machine Learning Techniques


This paper didn’t use statistical properties (such as mean, median and standard deviation) representation
for analyses. It also didn’t plot graphs like scatter plots, histograms, box plots and whiskers for
visualization. Hence, the analyses part was not performed well here.
However, it did provide an idea of the algorithms to be used. The algorithms used in this paper provided
accuracy, precision and recall lower than ours. Thus our project is better than this papers’.

5.2.4 Early Prediction of Credit Card Fraud Detection using Isolation Forest Tree and Local
Outlier Factor Machine Learning Algorithms
This provided the idea of cleaning in the project. However, the algorithms they used were not as efficient
as ours because we have used higher value of number of neighbors and estimator factor for Local outlier
factor and Isolation forest tree respectively. The comparison of their accuracy with ours

52
is shown as follows:

99.78
99.76
99.74
99.72
99.7
This Research
99.68 Paper
99.66
Our Research
99.64 Paper
99.62
99.6
99.58
Local Outlier Isolation Forest
Factor Tree

5.2.5 Credit Card Fraud Detection Using Machine Learning


The KNN algorithm used by this research paper used less number of neighbors for classification which
resulted in more poor confusion matrix than ours. Also they didn’t removed outliers properly from their
dataset. We detected more number of true positive and true negative and less numbers of false positive
and false negative which resulted in better accuracy for our model. They have used 4000 data instances to
train their dataset while we used 5000 data instances. The comparison of confusion matrix is given below:

The comparison of accuracies is also given as follows:

100
99.9
99.8 This Research
99.7 Paper

99.6 Our Research


Paper
99.5
99.4
KNN

53
5.2.6 Credit Card Fraud Detection using Random Forest
This paper is not from any standard website. The confusion matrix that showed up in this paper is worse
or poorer than ours. It maybe because of the difference in the codes of algorithm (they might have used
low value of estimator factor in their algorithm for Random Forest). Also their outlier removal algorithm
might not be as efficient and accurate as ours. A comparison of the confusion matrix of their algorithm
and ours is given below:

This research paper Our research paper


A comparison of their respective accuracies is below:

100

99.95

99.9 This Research


Paper
99.85 Our Research
Paper
99.8

99.75
Random Forest

5.2.7 Credit Card Fraud Detection Using Machine Learning


This is a standard research paper taken from IEEE website. We used different library than them which
made our confusion matrix of Isolation forest better than theirs. Also our outlier removal technique might
have been better than theirs. A comparison of the confusion matrix is given below:

54
4997 0

1 2

The comparison of the accuracies is given below:

99.985
99.98
99.975
99.97
99.965 This Research
99.96 Paper
99.955 Our Research
99.95 Paper
99.945
99.94
99.935
Random Forest

55
CHAPTER 6

CONCLUSION AND FUTURE WORK

56
6.1 Conclusion

As we saw above, the Isolation forest algorithm gave better results than the Local outlier
factor algorithm. This means that Isolation forest algorithm detected more outliers for
the same number of dataset instances used. Also these results were far better than those
used in other research papers because of the high number of neighbours and value of
estimator factor used by us.
As far as the learning algorithms are concerned, we provided both KNN and Random
Forest algorithms with a total of 25000 data instances. Out of this, 20000 data instances
or examples were used to train the model while the rest 5000 instances were used for the
testing of the model.
For KNN, out of total 5000 testing data instances, 4993 points belong to the class 0 and
3 points belong to class 1. The confusion matrix clearly shows that our model has
performed well despite having very imbalanced dataset. The accuracy given by KNN is
99.920%.
The Random Forest’s confusion matrix also depicts the fact that it has been less
successful in its predictions. The accuracy given by Random forest is 99.960%. Thus,
Random Forest’s performance is much better than KNN.
Thus, Random Forest gives more accurate predictions is and also requires less time for
both training and testing phase.
The KNN will give better results with a larger number of training data, but then, the time
taken for testing dataset’s execution will bear. The use of more complex preprocessing
techniques on the dataset will also help.

6.2 Future scope of improvement

While hundred-percent accuracy in fraud detection couldn’t be achieved, we


successfully made a model that can, with enough time and data, get more close to that
aim. There is always scope for improvement as so is true for our project.

The project allows multiple algorithms to be integrated together and their outcomes are
combined to increase the final accuracy.

57
This model can be improved further with the use of more modern algorithms. However,
the output given by these algorithms needs to be in the same format as given by others.
Once this condition is fulfilled, the algorithms are simple to integrate. This gives a huge
extent of versatility to the project.
The dataset can also be further improved. As already told, the accuracy of the algorithms
increases when the size of dataset is huge. Hence, more data instances in the dataset will
definitely make the model more accurate in correctly detecting frauds and also in
reducing the number of false positives, thereby increasing accuracy. However, this
requires support from the banks and other institutions.

58
CHAPTER 7
REFERENCES

59
[1] Munira Ansari, Hashim Malik, Siddhesh Jadhav, Zaiyyan Khan, “Credit Card Fraud
Detection”, International Journal of Engineering Research & Technology (IJERT),
NREST - 2021 Conference Proceedings.
[2] Mr. S P Maniraj, Aditya Saini, Shadab Ahmed, “Credit Card Fraud Detection using
Machine Learning and Data Science”, Sixth International Conference on Intelligent
Systems Design and Engineering Applications (2019), IEEE.
[3] Mr. Manohar, Arvind Bedi, Shashank Kumar, “Fraud Detection in Credit Card using
Machine Learning Techniques”, International Research Journal of Engineering and
Technology (IRJET), Volume: 07 Issue: 04 | Apr 2020.
[4] Mr. Arjun K P, Subhash Singh Negi, “Early Prediction of Credit Card Fraud
Detection using Isolation Forest Tree and Local Outlier Factor Machine Learning
Algorithms”.
[5] Rahul Powar, Rohan Dawkhar, Pratichi, “Credit Card Fraud Detection using
Machine Learning”, INTERNATIONAL JOURNAL OF ADVANCE SCIENTIFIC
RESEARCH AND ENGINEERING TRENDS, Vol. 5 Iss. 9, September 2020,
Springer.
[6] Mrs. Indira, Devi Meenakshi, Gayathri, “Credit Card Fraud Detection using Random
Forest”, International Research Journal of Engineering and Technology (IRJET),
Volume: 06 Issue: 03 | Mar 2019.
[7] Ruttala Sailusha, R. Ramesh, V. Gnaneswar, “Credit Card Fraud Detection Using
Machine Learning”, Proceedings of the International Conference on Intelligent
Computing and Control Systems (ICICCS 2020), IEEE.

60

You might also like