Professional Documents
Culture Documents
A PROJECT REPORT
Submitted by
DEEPIKA MOHANTY (1701348146)
MONALISHA MISHRA (1701348176)
RITIK BISWAL(1701348180)
SAGAR MALIK (1701348183)
SANDEEP KUMAR DEY(1701348187)
BACHELOR OF TECHNOLOGY
IN
1
BIJU PATNAIK UNIVERSITY OF TECHNOLOGY
BONAFIDE CERTIFICATE
Certified that this project report “CREDIT CARD FRAUD DETECTION” is the bonafide
work of “Deepika Mohanty, Monalisa Mishra, Ritik Biswal, Sagar Malik, Sandeep
Kumar Dey” who carried out the project work under my supervision.
Prof. Nilamadhab Mishra Prof. Chandan Kumar Panda Prof. Chandan Kumar Panda
2
CERTIFICATE OF EVALUATION
The report of the project are submitted by the above student in partial fulfilment for the awards of Bachelor
of Technology in Computer Science and Engineering, Biju Patnaik University of Technology are evaluated
and confirmed to the reports of the work done.
3
TABLE OF CONTENTS
1. INTRODUCTION 9-11
4
LIST OF FIGURES
5
ACKNOWLEDGEMENT
I also take this opportunity to express a deep sense of gratitude to Er. Rama Narayan Sabat,
Vice-Chairman, Prof. Satya Prakash Das,Vice-Principal, GIET, Ghangapatana, Bhubaneswar, for his
cordial support, valuable information and guidance, which helped me in completing this task
through various stages.
I wish to express my profound and sincere gratitude to Prof. Chandan Kumar Panda, Project
coordinator Department of Computer Science & Engineering, who guided me into the intricacies of
this project nonchalantly with matchless magnanimity.
I also extend my sincere appreciation to Faculty members are provided valuable suggestions and
precious time in accomplishing my major project report.
Lastly, I would like to thank the almighty and my parents for their moral support and friends with
whom I shared my day-to-day experiences and received lots of suggestions those improved the
quality of work.
6
ABSTRACT
Yawning is caused by the tiredness and fatigue and this may lead to various accidents. They
can be prevented by taking effort to get enough sleep before driving, drink coffee or energy
drink, or have a rest when the signs of drowsiness occur. The popular drowsiness detection
method uses complex methods, such as EEG and ECG. This method has high accuracy for its
measurement but it need to use contact measurement and it has many limitations on
driver fatigue and drowsiness monitor. Thus, it is not comfortable to be used in real time
driving. This paper proposes a way to detect the drowsiness signs among drivers by
measuring the eye closing rate and yawning.
This project describes on how to detect the eyes and mouth in a video recorded from the
experiment conducted by MIROS (Malaysian Institute of Road Safety). In the video, a
participant will drive the driving simulation system and a webcam will be place in front of
the driving simulator. The video will be recorded using the webcam to see the transition
from awake to fatigue and finally, drowsy. The designed system deals with detecting the
face area of the image captured from the video. The purpose of using the face area so it can
narrow down to detect eyes and mouth within the face area. Once the face is found, the
eyes and mouth are found by creating the eye for left and right eye detection and also
mouth detection.
The parameters of the eyes and mouth detection are created within the face image. The
video were change into images frames per second. From there, locating the eyes and mouth
can be performed. Once the eyes are located, measuring the intensity changes in the eye
area determine the eyes are open or closed.
If the eyes are found closed for 4 consecutive frames, it is confirm that the driver is in
drowsiness condition.
7
INTRODUCTION
Drowsiness is a state of near sleep, where the person has a strong desire for sleep. It has
two distinct meanings, referring both to the usual state preceding falling asleep and the
chronic condition referring to being in that state independent of a daily rhythm [16].
Sleepiness can be dangerous when performing tasks that require constant concentration,
such as driving a vehicle. When a person is sufficiently fatigue while driving, they will
experience drowsiness and this leads to increase the factor of road accident.
Figure 1 shows the statistic of road accident in Malaysia from the year 2005 to 2009
provided by MIROS (Malaysia Institute of Road Safety). The numbers of vehicles involved in
road accident keep increasing each year. From Figure 1, car and taxi type of vehicles shows
about nearly 400,000 cases of road accident has been recorded. It keeps increasing every
year and by the year 2009, it shows the number of road accident were recorded by MIROS
are nearly 500,000.
8
Figure 2: Examples of Fatigue & Drowsiness Condition which causes yawning
The aim of this project is to develop a simulation of drowsiness detection system. The
focus will be placed on designing a system that will accurately monitor the open or closed
state of the driver’s eyes and mouth. By monitoring the eyes, it is believed that the
symptoms of driver's drowsiness can be detected in sufficiently early stage, to avoid a car
accident. Yawning detection is a method to assess the driver’s fatigue. When a person is
fatigue, they keep yawning to ensure that there is enough oxygen for the brain
consumption before going to drowsiness state [17]. Detection of fatigue and drowsiness
involves a sequence of images of a face, and the observation of eyes and mouth open or
closed duration. Another method to detect eye closure is PERCLOS. This detection method
is based on the time of eyes closed which refers to percentage of a specific time.
The analysis of face images is a popular research area with applications such as face
recognition, and human identification and tracking for security systems. This project is
focused on the localization of the eyes and mouth, which involves looking at the entire
image of the face, and determining the position of the eyes and mouth, by applying the
existing methods in image processing algorithm. Once the position of the eyes is located,
the system is designed to determine whether the eyes and mouth are opened or closed,
and detect fatigue and drowsiness.
9
WHAT ARE FRAUDULENT TRANSACTIONS?
The purpose may be to obtain goods without paying, or to obtain unauthorized funds
from an account.
Fraudulent transactions are orders and purchases made using a credit card or bank
account that does not belong to the buyer.
One of the largest factors in identity fraud, these types of transactions can end up
doing damage to both merchants and the identity fraud victim.
Fraud detection involves monitoring the behaviour of users in order to estimate, detect, or
avoid undesirable behaviour. To counter the credit card fraud effectively, it is necessary to
understand the technologies involved in detecting credit card frauds and to identify
various types of credit card frauds.
The credit card is a small plastic card issued to users as a system of payment. it allows its
cardholder to buy goods and services based on the cardholder's promise to pay for these
goods and services. credit card security relies on the physical security of the plastic card as
well as the privacy of the credit card number. globalization and increased use of the internet
for online shopping has resulted in a considerable proliferation of credit card transactions
10
throughout the world. thus, a rapid growth in the number of credit card transactions has led
to a substantial rise in fraudulent activities.
Credit card fraud is a wide-ranging term for theft and fraud committed using a credit card as
a fraudulent source of funds in a given transaction. credit card fraudsters employ a large
number of techniques to commit fraud. to combat the credit card fraud effectively, it is
important to first understand the mechanisms of identifying a credit card fraud. over the
years credit card fraud has stabilized much due to various credit card fraud detection and
prevention mechanisms.
In recent years, the prevailing data mining concerns people with credit card fraud detection
model based on data mining. Since our problem is approached as a classification problem,
classical data mining algorithms are not directly applicable. So an alternative approach is
made by using general purpose meta heuristic approaches like genetic algorithms.
This project is to propose a credit card fraud detection system using genetic algorithm.
Genetic algorithms are evolutionary algorithms which aim at obtaining better solutions as
time progresses. When a card is copied or stolen or lost and captured by fraudsters it is
usually used until its available limit is depleted. Thus, rather than the number of correctly
classified transactions, a solution which minimizes the total available limit on cards subject
to fraud is more prominent. It aims in minimizing the false alerts using genetic algorithm
where a set of interval valued parameters are optimized.
11
12
PROBLEM DEFINITION AND FEASIBILITY ANALYSIS
To develop a credit card fraud detection system using genetic algorithm. During the
credit card transaction, the fraud is detected and the number of false alerts is being minimized
by using genetic algorithm. Instead of maximizing the numbers of correctly classified
transactions we defined an objective function where the misclassification costs are variable
and thus, correct classification of some transactions are more important than correctly
classifying the others.
Fraud detection has been usually seen as a data mining problem where the objective is
to correctly classify the transactions as legitimate or fraudulent. For classification problems
many performance measures are defined most of which are related with correct number of
cases classified correctly.
13
A more appropriate measure is needed due to the inherent structure of credit card
transactions. When a card is copied or stolen or lost and captured by fraudsters it is usually
used until its available limit is depleted. Thus, rather than the number of correctly classified
transactions, a solution which minimizes the total available limit on cards subject to fraud is
more prominent.
Since the fraud detection problem has mostly been defined as a classification problem, in
addition to some statistical approaches many data mining algorithms have been proposed to
solve it. Among these, decision trees and artificial neural networks are the most popular
ones. The study of Bolton and Hand provides a good summary of literature on fraud
detection problems.
A feasibility analysis is an important tool to help you assess the viability of starting a new
value-added business, or re-organizing or expanding an existing business.
All projects are feasible provided with unlimited resources and infinite time. But
unfortunately, scarcity of resources and difficult delivery dates plagues all projects.
The following three kinds of feasibilities are studied in the feasibility analysis of the project.
Operational feasibility.
Technical feasibility.
14
Economical feasibility.
The operational scope of the system is verified under operational feasibility. The
proposed system will have enough operational reach, which ensures the security of the
information. Hence, operational feasibility of the proposed system is found to be high.
This project involves the general user-friendly windows environment. Graphical user
Interface, being today de fecto standard, has been exploited to give the user a nice look and
feel. Operational feasibility ensures that the project is successfully implemented. The project
can be used by the users with basic internet knowledge. Hence we conclude that this project is
operationally feasible.
If the technical features that are available in the existing system are suited to accommodate the
proposed system, then the system that has been developed is said to be technically feasible.
As all the technology for this project is available in the latest Browsers, this project is
technically feasible.
15
2.2.3. ECONOMICAL FEASIBILITY
Economic analysis is the most frequently used method for evaluating the
effectiveness of a new system. More commonly it is known as cost/benefit
analysis. The software used in this project is freeware so the cost of
developing the tool is minimal. It requires very easy technique and minimal
software. So, it does not need much cost and software. So, it can be used in
any environment
16
SOFTWARE REQUIREMENTS SPECIFICATION
3.1. INTRODUCTION.
17
3.2.1. PURPOSE
The purpose of this document is to define the requirements of credit card fraud
detection. In detail, this document will provide a general description of our
project, including user requirements, product perspective, and overview of
requirements, general constraints. In addition, it will also provide the specific
requirements and functionality needed for this project - such as interface,
functional requirements and performance requirements.
3.2.2. SCOPE
The scope of this SRS document persists for the entire life cycle of the project.
This document defines the final state of the software requirements agreed upon
by the customers and designers. Finally at the end of the project execution all
the functionalities may be traceable from the SRS to the product. The document
describes the functionality, performance, constraints, interface and reliability
for the entire life cycle of the project.
18
3.2.3. OVERVIEW
The software requirement specification document for the system covers the
following two sections:
GENERAL DESCRIPTIONS:
SPECIFICATION REQUIREMENT:
This section describes about both the functional and non functional requirement
of the system. The functional requirement section defines the system external
interface, general requirement, performance, design constraint etc.
The credit card fraud detection system has been developed to alert the customer
regarding the fraud of their credit card. After the payment process the
transactions performed is verified whether the performed transaction is real or
fraud transaction and minimizes the false alert by implementing genetic
algorithm.
19
3.2.4.1. PRODUCT FUNCTION
Customers are those who make the transaction through any means.
20
SYSTEMANALYSIS
This chapter gives the information regarding analysis done for the
proposed system. System Analysis is done to capture the requirement of the
user of the proposed system. It also provides the information regarding the
existing system and also the need for the proposed system. The key features
of the proposed system and the requirement specifications of the proposed
system are discussed below.
The Traditional detection method mainly depends on database system and the
education of customers, which usually are delayed, inaccurate and not in-
time. After that methods based on discrimate analysis and regression analysis
are widely used which can detect fraud by credit rate for cardholders and
credit card transaction. For a large amount of data it is not efficient.
The high number of losses due to fraud and the awareness of the relation
between loss and the available limit has to be reduced. The fraud has to be
deducted in real time and the number of false alert has to be minimized.
21
4.3 PROPOSED SYSTEM
22
SYSTEM DESIGN
The process of design involves ―conceiving and planning out in mind and
making a drawing, pattern or a sketch‖. The system design transforms a logical
representation of what a given system is required to do into the physical reality
during development. Important design factors such as reliability, response time,
throughput of the system, maintainability, expandability etc., should be taken
into account. Design constraints like cost, hardware limitations, standard
compliance etc should also be dealt with. The task of system design is to take
the description and associate with it a specific set of facilities-men, machines
(computing and other), accommodation, etc., to provide complete
specifications of a workable system.
This new system must provide for all of the essential data processing and it
may also do some of those tasks identified during the work of analysis as
optional extras. It must work within the imposed constraints and show
improvement over the existing system. At the outset of design, a choice must
be made between the main approaches. Talks of ‗preliminary design‖
concerned with identification analysis and selections of the major design
options are available for development and implementation of a system. These
options are most readily distinguished in terms of the physical facilities to be
used for the processing who or what does the work.
23
LITEARTURE REVIEW
There have also been efforts to progress from a completely new aspect.
Attempts have been made to improve the alert-feedback interaction in case of
fraudulent transaction.
24
Artificial Genetic Algorithm, one of the approaches that shed new light in this
domain, countered fraud from a different direction.
It proved accurate in finding out the fraudulent transactions and minimizing the
number of false alerts. Even though, it was accompanied by classification
problem with variable misclassification costs.
METHODOLOGY
The approach that this paper proposes, uses the latest machine learning
algorithms to detect anomalous activities, called outliers.
The basic rough architecture diagram can be represented with the following
figure:
Fig-1
25
When looked at in detail on a larger scale along with real life elements, the full
architecture diagram can be represented as follows:
Fig-2
First of all, we obtained our dataset from Kaggle, a data analysis website which
provides datasets.
Inside this dataset, there are 31 columns out of which 28 are named as v1-v28
to protect sensitive data.
The other columns represent Time, Amount and Class. Time shows the time
gap between the first transaction and the following one. Amount is the amount
of money transacted. Class 0 represents a valid transaction and 1 represents a
fraudulent one.
26
CODING
Furthermore, using metrics such as precision, recall, and F1-scores, we will investigate why the
classification accuracy for these algorithms can be misleading.
In addition, we will explore the use of data visualization techniques common in data science,
such as parameter histograms and correlation matrices, to gain a better understanding of the
underlying distribution of data in our data set. Let's get started!
importsys
importnumpy
importpandas
importmatplotlib
importseaborn
importscipy
print('Python: {}'.format(sys.version))
print('Numpy: {}'.format(numpy.__version__))
print('Pandas: {}'.format(pandas.__version__))
print('Matplotlib: {}'.format(matplotlib.__version__))
print('Seaborn: {}'.format(seaborn.__version__))
print('Scipy: {}'.format(scipy.__version__))
Python: 2.7.13 |Continuum Analytics, Inc.| (default, May 11 2017, 13:17:26)
[MSC v.1500 64 bit (AMD64)]
Numpy: 1.14.0
Pandas: 0.21.0
Matplotlib: 2.1.0
Seaborn: 0.8.1
Scipy: 1.0.0
27
In [2]:
# import the necessary packages
importnumpyasnp
importpandasaspd
importmatplotlib.pyplotasplt
importseabornassns
In [3]:
# Load the dataset from the csv file using pandas
data =pd.read_csv('creditcard.csv')
In [4]:
# Start exploring the dataset
print(data.columns)
Index([u'Time', u'V1', u'V2', u'V3', u'V4', u'V5', u'V6', u'V7', u'V8',
u'V9',
u'V10', u'V11', u'V12', u'V13', u'V14', u'V15', u'V16', u'V17',
u'V18',
u'V19', u'V20', u'V21', u'V22', u'V23', u'V24', u'V25', u'V26',
u'V27',
u'V28', u'Amount', u'Class'],
dtype='object')
In [5]:
# Print the shape of the data
data =data.sample(frac=0.1, random_state=1)
print(data.shape)
print(data.describe())
28
std 47584.727034 1.994661 1.709050 1.522313
1.420003
min 0.000000 -40.470142 -63.344698 -31.813586
-5.266509
25% 53924.000000 -0.908809 -0.610322 -0.892884
-0.847370
50% 84551.000000 0.031139 0.051775 0.178943
-0.017692
75% 139392.000000 1.320048 0.792685 1.035197
0.737312
max 172784.000000 2.411499 17.418649 4.069865
16.715537
V5 V6 V7 V8 V9
\
count 28481.000000 28481.000000 28481.000000 28481.000000 28481.000000
mean -0.015666 0.003634 -0.008523 -0.003040 0.014536
std 1.395552 1.334985 1.237249 1.204102 1.098006
min -42.147898 -19.996349 -22.291962 -33.785407 -8.739670
25% -0.703986 -0.765807 -0.562033 -0.208445 -0.632488
50% -0.068037 -0.269071 0.028378 0.024696 -0.037100
75% 0.603574 0.398839 0.559428 0.326057 0.621093
max 28.762671 22.529298 36.677268 19.587773 8.141560
Class
29
count 28481.000000
mean 0.001720
std 0.041443
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 31 columns]
In [6]:
# Plot histograms of each parameter
data.hist(figsize= (20, 20))
plt.show()
30
In [7]:
# Determine number of fraud cases in dataset
outlier_fraction=len(Fraud)/float(len(Valid))
print(outlier_fraction)
In [8]:
# Correlation matrix
corrmat=data.corr()
fig =plt.figure(figsize= (12, 9))
31
In [9]:
# Get all the columns from the dataFrame
columns =data.columns.tolist()
X = data[columns]
Y = data[target]
# Print shapes
print(X.shape)
print(Y.shape)
(28481, 30)
(28481L,)
32
3. Unsupervised Outlier Detection
Now that we have processed our data, we can begin deploying our machine learning algorithms.
We will use the following techniques:
The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation
of density of a given sample with respect to its neighbors. It is local in that the anomaly score
depends on how isolated the object is with respect to the surrounding neighbourhood.
The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly
selecting a split value between the maximum and minimum values of the selected feature.
Since recursive partitioning can be represented by a tree structure, the number of splitting
required to isolate a sample is equivalent to the path length from the root node to the terminating
node.
This path length, averaged over a forest of such random trees, is a measure of normality and our
decision function.
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of
random trees collectively produce shorter path lengths for particular samples, they are highly
likely to be anomalies.
In [11]:
fromsklearn.metricsimportclassification_report, accuracy_score
fromsklearn.ensembleimportIsolationForest
fromsklearn.neighborsimportLocalOutlierFactor
In [15]:
# Fit the model
plt.figure(figsize=(9, 7))
n_outliers=len(Fraud)
33
# fit the data and tag outliers
ifclf_name=="Local Outlier Factor":
y_pred=clf.fit_predict(X)
scores_pred=clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred=clf.decision_function(X)
y_pred=clf.predict(X)
Isolation Forest: 71
0.99750711000316
precision recall f1-score support
34
IMPLEMENTATION
This idea is difficult to implement in real life because it requires the cooperation
from banks, which aren’t willing to share information due to their market
competition, and also due to legal reasons and protection of data of their users.
“This technique was applied to a full application data set supplied by a German
bank in 2006. For banking confidentiality reasons, only a summary of the
results obtained is presented below. After applying this technique, the level 1
list encompasses a few cases but with a high probability of being fraudsters.
All individuals mentioned in this list had their cards closed to avoid any risk due
to their high-risk profile. The condition is more complex for the other list. The
level 2 list is still restricted adequately to be checked on a case-by-case basis.
Credit and collection officers considered that half of the cases in this list could
be considered as suspicious fraudulent behaviour. For the last list and the
largest, the work is equitably heavy. Less than a third of them are suspicious.
In order to maximize the time efficiency and the overhead charges, a possibility
is to include a new element in the query; this element can be the five first digits
of the phone numbers, the email address, and the password, for instance, those
new queries can be applied to the level 2 list and level 3 list.”.
RESULTS
35
The code prints out the number of false positives it detected and compares it
with the actual values. This is used to calculate the accuracy score and precision
of the algorithms.
The fraction of data we used for faster testing is 10% of the entire dataset. The
complete dataset is also used at the end and both the results are printed.
These results along with the classification report for each algorithm is given in
the output as follows, where class 0 means the transaction was determined to be
valid and 1 means it was determined as a fraud transaction.
This result matched against the class values to check for false positives.
Fig-3
36
Fig-4
CONCLUSION
37
Credit card fraud is without a doubt an act of criminal dishonesty. This article
has listed out the most common methods of fraud along with their detection
methods and reviewed recent findings in this field. This paper has also
explained in detail, how machine learning can be applied to get better results in
fraud detection along with the algorithm,
While the algorithm does reach over 99.6% accuracy, its precision remains
only at 28% when a tenth of the data set is taken into consideration. However,
when the entire dataset is fed into the algorithm, the precision rises to 33%.
This high percentage of accuracy is to be expected due to the huge imbalance
between the number of valid and number of genuine transactions.
Since the entire dataset consists of only two days’ transaction records, its only a
fraction of data that can be made available if this project were to be used on a
commercial scale. Being based on machine learning algorithms, the program
will only increase its efficiency over time as more data is put into it.
FUTURE ENHANCEMENT
38
While we couldn’t reach out goal of 100% accuracy in fraud detection, we did
end up creating a system that can, with enough time and data, get very close to
that goal. As with any such project, there is some room for improvement here.
The very nature of this project allows for multiple algorithms to be integrated
together as modules and their results can be combined to increase the accuracy
of the final result.
This model can further be improved with the addition of more algorithms into it.
However, the output of these algorithms needs to be in the same format as the
others. Once that condition is satisfied, the modules are easy to add as done in
the code. This provides a great degree of modularity and versatility to the
project.
39
REFERENCES
[1] “Credit Card Fraud Detection Based on Transaction Behaviour -by John
Richard D. Kho, Larry A. Vea” published by Proc. of the 2017
IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017
[6] “Credit Card Fraud Detection: A Realistic Modeling and a Novel Learning
Strategy” published by IEEE TRANSACTIONS ON
NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8,
AUGUST 2018
[7] “Credit Card Fraud Detection-by Ishu Trivedi, Monika, Mrigya, Mridushi”
published by International Journal of Advanced Research in Computer and
Communication Engineering Vol. 5, Issue 1, January 2016
[8] David J.Wetson,DavidJ.Hand,MAdams,Whitrow and Piotr Jusczak
“Plastic Card Fraud Detection using Peer Group Analysis” Springer,
Issue 2008.
40
41