You are on page 1of 18

G H RAISONI COLLEGE OF ENGINEERING AND

MANGENMENT
(An Autonomous Institute affiliated to SPPU)

AY 2020-21
MINI PROJECT – Machine Learning
Credit Cards Fraud Detection System
Third Year VI Semester
GROUP: 15

Group Members: 1. Aman Sarda TCOA04


2. Gaurav Chouksey TCOA19
3. Amey Ghorpade TCOA26
4. Shubham Kachore TCOA32
Guide: Asst. Prof. Sandeep Gore

ML Mini Project- Credit Cards Fraud Detection System


Table of Contents

 Abstract
 Introduction
 Literature Review
 Proposed Technique
 Performance Metrics and Experimental results
 Scope of future application
 Conclusions
 References

ML Mini Project- Credit Cards Fraud Detection System


Abstract
Due to a rapid advancement in the electronic commerce technology,
the use of credit cards has dramatically increased. Since credit card is
the most popular mode of payment, the number of fraud cases
associated with it is also rising.

It is vital that credit card companies are able to identify fraudulent


credit card transactions so that customers are not charged for items
that they did not purchase. Such problems can be tackled with Data
Science and its importance, along with Machine Learning, cannot be
overstated.

This project intends to illustrate the modelling of a data set using


machine learning with Credit Card Fraud Detection. The Credit Card
Fraud Detection Problem includes modelling past credit card
transactions with the data of the ones that turned out to be fraud. This
model is then used to recognize whether a new transaction is
fraudulent or not.

Our objective here is to detect 100% of the fraudulent transactions


while minimizing the incorrect fraud classifications. Credit Card
Fraud Detection is a typical sample of classification.
In this process, we have focused on analysing and pre-processing data
sets as well as the deployment of multiple anomaly detection
algorithms such as Local Outlier Factor and Isolation Forest algorithm
on the PCA transformed Credit Card Transaction data.

Keywords— Credit card fraud, applications of machine


learning, data science, isolation forest algorithm, local outlier
factor, automated fraud detection.

Introduction
ML Mini Project- Credit Cards Fraud Detection System
‘Fraud’ in credit card transactions is unauthorized and unwanted
usage of an account by someone other than the owner of that account.
Necessary prevention measures can be
taken to stop this abuse and the behaviour of such fraudulent
practices can be studied to minimize it and protect against
similar occurrences in the future. In other words, Credit Card Fraud
can be defined as a case where a person uses someone else’s credit
card for personal reasons while the owner and the card issuing
authorities are unaware of the fact that the card is
being used.
Fraud detection involves monitoring the activities of
populations of users in order to estimate, perceive or avoid
objectionable behaviour, which consist of fraud, intrusion, and
defaulting.
This is a very relevant problem that demands the attention of
communities such as machine learning and data science where
the solution to this problem can be automated.
This problem is particularly challenging from the perspective
of learning, as it is characterized by various factors such as
class imbalance. The number of valid transactions far
outnumber fraudulent ones. Also, the transaction patterns
often change their statistical properties over the course of
time. These are not the only challenges in the implementation of a
real-world fraud detection system, however. In real world
examples, the massive stream of payment requests is quickly
scanned by automatic tools that determine which transactions
to authorize.
Machine learning algorithms are employed to analyse all the
authorized transactions and report the suspicious ones.
These reports are investigated by professionals who contact the
cardholders to confirm if the transaction was genuine or fraudulent.
The investigators provide a feedback to the automated system which
is used to train and update the algorithm to eventually improve the
fraud-detection performance over time.

ML Mini Project- Credit Cards Fraud Detection System


Fraud detection methods are continuously developed to defend
criminals in adapting to their fraudulent strategies. These frauds are
classified as:
• Credit Card Frauds: Online and Offline
• Card Theft
• Account Bankruptcy
• Device Intrusion
• Application Fraud
• Counterfeit Card
• Telecommunication Fraud

Some of the currently used approaches to detection of such


Frauds are:
• Artificial Neural Network
• Fuzzy Logic
• Genetic Algorithm
• Logistic Regression
• Decision tree
• Support Vector Machines
• Bayesian Networks
• Hidden Markov Model
• K-Nearest Neighbour

Literature Review
ML Mini Project- Credit Cards Fraud Detection System
1. The Uncertain Case of Credit Card Fraud Detection:
Uncertainty is inherent in many real-time event-driven applications.
Credit card fraud detection is a typical uncertain domain, where
potential fraud incidents must be detected in real time and tagged
before the transaction has been accepted or denied. We present
extensions to the IBM Proactive Technology Online (PROTON) open
source tool to cope with uncertainty. The inclusion of uncertainty
aspects impacts all levels of the architecture and logic of an event
processing engine. The extensions implemented in PROTON include
the addition of new built-in attributes and functions, support for new
types of operands, and support for event processing patterns to cope
with all these. The new capabilities were implemented as building
blocks and basic primitives in the complex event processing
programmatic language. This enables implementation of event-driven
applications possessing uncertainty aspects from different domains in
a generic manner. A first application was devised in the domain of
credit card fraud detection. Our preliminary results are encouraging,
showing potential benefits that stem from incorporating uncertainty
aspects to the domain of credit card fraud detection [1].(Author-
Fabiana Fournier, Ivo carriea, Inna skarbovsky)

2. A Comparative Analysis of Various Credit Card Fraud


Detection Techniques:
Fraud is any malicious activity that aims to cause financial loss to the
other party. As the use of digital money or plastic money even in
developing countries is on the rise so is the fraud associated with
them. Frauds caused by Credit Cards have costs consumers and banks
billions of dollars globally. Even after numerous mechanisms to stop
fraud, fraudsters are continuously trying to find new ways and tricks
to commit fraud. Thus, in order to stop these frauds, we need a
powerful fraud detection system which not only detects the fraud but
also detects it before it takes place and in an accurate manner. We
need to also make our systems learn from the past committed frauds
and make them capable of adapting to future new methods of frauds.
In this paper we have introduced the concept of frauds related to
credit cards and their various types. We have explained various
techniques available for a fraud detection system such as Support

ML Mini Project- Credit Cards Fraud Detection System


Vector Machine (SVM), Artificial Neural Networks (ANN), Bayesian
Network, K- Nearest Neighbor (KNN), Hidden Markov Model, Fuzzy
Logic Based System and Decision Trees. An extensive review is done
on the existing and proposed models for credit card fraud detection
and has done a comparative study on these techniques on the basis of
quantitative measurements such as Credit card fraud detection using
Machine learning algorithms Corresponding Author: Andhavarapu
Bhanusri6 | Page
accuracy, detection rate and false alarm rate. The conclusion of our
study explains the drawbacks of existing models and provides a better
solution in order to overcome them [2].(Author-Yashvi Jain, Namrata
Tiwari, ShripriyaDubey,Sarika Jain)

3. Credit Card Fraud Detection System-A Survey:


The credit card has become the most popular mode of payment for
both online as well as regular purchase, in cases of fraud associated
with it are also rising. Credit card frauds are increasing day by day
regardless of the various techniques developed for its detection.
Fraudsters are so expert that they generate new ways for committing
fraudulent transactions each day which demands constant innovation
for its detection techniques. Most of the techniques based on Artificial
Intelligence, Fuzzy logic, neural network, logistic regression, naïve
Bayesian, Machine learning, Sequence Alignment, decision tree,
Bayesian network, meta learning, Genetic Programming etc., these
are evolved in detecting various credit card fraudulent transactions.
This paper presents a survey of various techniques used in credit card
fraud detection mechanisms [3]. (Author-Dinesh L. Talekar, K. P.
Adhiya)

Proposed Technique

ML Mini Project- Credit Cards Fraud Detection System


The proposed techniques are used in this paper, for detecting the
frauds in credit card system. The comparisons are made for different
machine learning algorithms such as Logistic Regression, Decision
Trees, Random Forest, to determine which algorithm gives suits best
and can be adapted by credit card merchants for identifying fraud
transactions.
The Figure1 shows the architectural diagram for representing the
overall system framework. The processing steps are discussed in
Table 1 to detect the best algorithm for the given dataset

Table 1: Processing steps

Algorithm steps:
Step 1: Read the dataset.
Step 2: Random Sampling is done on the data set to make it
balanced.
Step 3: Divide the dataset into two parts i.e., Train dataset and
Test dataset.
Step 4: Feature selection are applied for the proposed models.
Step 5: Accuracy and performance metrics has been calculated to
know the efficiency for different algorithms.
Step6: Then retrieve the best algorithm based on efficiency for the
given dataset.

ML Mini Project- Credit Cards Fraud Detection System


Figure1: System Architecture

1. Logistic Regression:
Logistic Regression is one of the classification algorithm, used to
predict a binary value in a given set of independent variables (1 / 0,
Yes / No, True / False). To represent binary / categorical values,
dummy variables are used. For the purpose of special case in the
logistic regression is a linear regression, when the resulting variable is
categorical then the log of odds is used for dependent variable and
also it predicts the probability of occurrence of an event by fitting data
to a logistic function. Such as
O = e^ (I0 + I1*x) / (1 + e^ (I0 + I1*x))
Where,
O is the predicted output
I0 is the bias or intercept term
I1 is the coefficient for the single input value (x).

Each column in the input data has an associated I coefficient (a


constant real value) that must be learned from the training data.

ML Mini Project- Credit Cards Fraud Detection System


y = e^ (b0 + b1*x) / (1 + e^ (b0 + b1*x))

Logistic regression is started with the simple linear regression


equation in which dependent variable can be enclosed in a link
function i.e., to start with logistic regression, We’ll first write the
simple linear regression equation with dependent variable enclosed in
a link function:
A(O) = β0 + β(x)
Where
A(): link function
O : outcome variable
x : dependent variable

A function is established using two things:


1) Probability of Success(pr)
2) Probability of Failure(1-pr).

pr should meet following criteria: a) probability must always be


positive (since p >= 0) b) probability must always be less than equals
to 1 (since pr <= 1). By applying exponential in the first criteria and
the value is always greater than equals to 1.
pr = exp(βo + β(x)) = e^(βo + β(x))

For the second criteria, same exponential is divided by adding 1 to it


so that the value will be less than equals to 1
pr = e^(βo + β(x)) / e^(βo + β(x)) + 1

Logistic function is used in the logistic regression in which cost


function quantifies the error, as it models response is compared with
the true value.
X(θ)=−1/m*(Σ yilog(hθ(xi))+(1−yi)log(1−hθ(xi)))
Where
hθ(xi) : logistic function
yi : outcome variable Gradient descent is a learning algorithm

ML Mini Project- Credit Cards Fraud Detection System


2. Decision Tree Algorithm:
Decision tree is a type of supervised learning algorithm (having a pre-
defined target variable) that is mostly used in classification problems.
It works for both categorical and continuous input and output
variables.
In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant
splitter / differentiator in input variables.

TYPES OF DECISION TREE

1. Categorical Variable Decision Tree: Decision Tree which has


categorical target variable then it called as categorical variable
decision tree.
2. Continuous Variable Decision Tree: Decision Tree has
continuous target variable then it is called as Continuous Variable
Decision Tree

TERMINOLOGY OF DECISION TREE:

1. Root Node: It represents entire population or sample and this


further gets divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-
nodes.
3. Decision Node: When a sub-node splits into further sub-nodes,
then it is called decision node.
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or
Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say opposite process of splitting.
6. Branch / Sub-Tree: A sub section of entire tree is called branch or
sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes
is called parent node of sub-nodes whereas sub-nodes are the child of
parent node.

ML Mini Project- Credit Cards Fraud Detection System


WORKING OF DECISION TREE:
Decision trees use multiple algorithms to decide to split a node in two
or more sub- nodes. The creation of sub-nodes increases the
homogeneity of resultant sub-nodes. In other words, we can say that
purity of the node increases with respect to the target variable.
Decision tree splits the nodes on all available variables and then
selects the split which results in most homogeneous sub-nodes.
1. Gini Index
2. Information Gain
3. Chi Square
4. Reduction of Variance

3. Random Forest:
Random forest is a tree based algorithm which involves building
several trees and combining with the output to improve generalization
ability of the model. This method of combining trees is known as an
ensemble method.
Ensembling is nothing but a combination of weak learners (individual
trees) to produce a strong learner. Random Forest can be used to solve
regression and classification problems. In regression problems, the
dependent variable is continuous. In classification problems, the
dependent variable is categorical.

ML Mini Project- Credit Cards Fraud Detection System


WORKING OF RANDOM FOREST:
1. Bagging Algorithm is used to create random samples.
2. Data set D1 is given for n rows and m columns and new data set
D2 is created for sampling n cases at random with replacement
from the original data.
3. From dataset D1,1/3rd of rows are left out and is known as Out
of Bag samples.
4. Then, new dataset D2 is trained to this models and Out of Bag
samples is used to determine unbiased estimate of the error. Out
of ‘m’ columns, M << m columns are selected at each node in
the data set.
5. The M columns are selected at random. Usually, the default
choice of M, is m/3 for regression tree and M is sqrt(m) for
classification tree.
6. Unlike a tree, no pruning takes place in random forest i.e., each
tree is grown fully. In decision trees, pruning is a method to
avoid over fitting. Pruning means selecting a sub tree that leads
to the lowest test error rate.
7. Cross validation is used to determine the test error rate of a sub
tree. Several trees are grown and the final prediction is obtained
by averaging or voting.

Table 2: Algorithm steps for finding the Best algorithm


Step 1: Import the dataset
Step 2: Convert the data into data frames format
Step3: Do random oversampling using ROSE package
Step4: Decide the amount of data for training data and testing data
Step5: Give 70% data for training and remaining data for testing.
Step6: Assign train dataset to the models
Step7: Choose the algorithm among 3 different algorithms and create the
model
Step8: Make predictions for test dataset for each algorithm
Step9: Calculate accuracy for each algorithm
Step10: Apply confusion matrix for each variable
Step11: Compare the algorithms for all the variables and find out the best
algorithm.

ML Mini Project- Credit Cards Fraud Detection System


PERFORMANCE METRICS AND
EXPERIMENTAL RESULTS:
1. Performance metrics:
The basic performance measures derived from the confusion matrix.
The confusion matrix is a 2 by 2 matrix table contains four outcomes
produced by the binary classifier. Various measures such as
sensitivity, specificity, accuracy and error rate are derived from the
confusion matrix.

Accuracy:
Accuracy is calculated as the total number of two correct
predictions(A+B) divided by the total number of the dataset(C+D).
It is calculated as (1-error rate).

Accuracy=A+B/C+D
Whereas,
A=True Positive
B=True Negative
C=Positive
D=Negative

Error rate:
Error rate is calculated as the total number of two incorrect
predictions(F+E) divided by the total number of the dataset(C+D).

Error rate=F+E/C+D
Whereas,
E=False Positive F=False Negative
C=Positive D=Negative

Sensitivity:
Sensitivity is calculated as the number of correct positive
predictions(A) divided by the total number of positives(C).
Sensitivity=A/C

Specificity:

ML Mini Project- Credit Cards Fraud Detection System


Specificity is calculated as the number of correct negative
predictions(B) divided by the total number of negatives(D).
Specificity=B/D

Accuracy, Error-rate, Sensitivity and Specificity are used to report the


performance of the system to detect the fraud in the credit card.
In this paper, three machine learning algorithms are developed to
detect the fraud in credit card system.
To evaluate the algorithms, 70% of the dataset is used for training and
30% is used for testing and validation.
Accuracy, error rate, sensitivity and specificity are used to evaluate
for different variables for three algorithms as shown in Table 3.
The accuracy result is shown for logistic regression; Decision tree and
random forest classifier are 92.7, 95.8, and 97.6 respectively.
The comparative results show that the Random forest performs better
than the logistic regression and decision tree techniques.
Table 3: Performance analysis for three different algorithms

Feature Logistic Decision tree Random Forest


Selection regression
For 5 variables 87.2 89 90.1
For 10 variables 88.6 92.1 93.6
For all Variables 90.0 94.3 95.5

Future Scope
ML Mini Project- Credit Cards Fraud Detection System
While we couldn’t reach out goal of 100% accuracy in fraud
detection, we did end up creating a system that can, with enough time
and data, get very close to that goal.
As with any such project, there is some room for improvement here.
The very nature of this project allows for multiple algorithms to be
integrated together as modules and their results can be combined to
increase the accuracy of the final result.
This model can further be improved with the addition of more
algorithms into it. However, the output of these algorithms needs to
be in the same format as the others. Once that condition is satisfied,
the modules are easy to add as done in the code. This provides a great
degree of modularity and versatility to the project.
More room for improvement can be found in the dataset. As
demonstrated before, the precision of the algorithms increases when
the size of dataset is increased. Hence, more data will surely make the
model more accurate in detecting frauds and reduce the number of
false positives. However, this requires official support from the banks
themselves.

Conclusion

ML Mini Project- Credit Cards Fraud Detection System


Credit card fraud is without a doubt an act of criminal dishonesty.
This synopsis has listed out the most common methods of fraud along
with their detection methods and reviewed recent findings in this
field. It has also explained in detail, how machine learning can be
applied to get better results in fraud detection along with the
algorithm, pseudocode, explanation its implementation and
experimentation results.
While the algorithm does reach over 99.6% accuracy, its precision
remains only at 28% when a tenth of the data set is taken into
consideration. However, when the entire dataset is fed into the
algorithm, the precision rises to 33%. This high percentage of
accuracy is to be expected due to the huge imbalance between the
number of valid and number of genuine transactions. Since the entire
dataset consists of limited transaction records, it’s only a fraction of
data that can be made available if this project were to be used on a
commercial scale. Being based on machine learning algorithms, the
program will only increase its efficiency over time as more data is put
into it.

ML Mini Project- Credit Cards Fraud Detection System


References
1) “Survey Paper on Credit Card Fraud Detection by Suman”,
Research Scholar, GJUS&T Hisar HCE, Sonepat published by
International Journal of Advanced Research in Computer
Engineering & Technology (IJARCET) Volume 3 Issue 3,
March 2014.

2) “Credit Card Fraud Detection: A Realistic Modeling and a


Novel Learning Strategy” published by IEEE
TRANSACTIONS ON NEURAL NETWORKS AND
LEARNING SYSTEMS, VOL. 29, NO.8, AUGUST 2018.

3) David J.Wetson,David J.Hand,M Adams,Whitrow and Piotr


Jusczak “Plastic Card Fraud Detection using Peer Group
Analysis” Springer, Issue 2008.

4) Yashvi Jain, Namrata Tiwari, ShripriyaDubey, Sarika Jain, A


“Comparative Analysis of Various Credit Card Fraud Detection
Techniques”, Blue Eyes Intelligence Engineering and Sciences
Publications 2019.

5) A. Shen, R. Tong, Y. Deng, "Application of classification


models on credit card fraud detection", Service Systems and
Service Management 2007 International Conference, pp. 1-4,
2007.

ML Mini Project- Credit Cards Fraud Detection System

You might also like