You are on page 1of 41

YAWN DETECTION SYSTEM

A PROJECT REPORT

Submitted by
DEEPIKA MOHANTY (1701348146)
MONALISHA MISHRA (1701348176)
RITIK BISWAL(1701348180)
SAGAR MALIK (1701348183)
SANDEEP KUMAR DEY(1701348187)

In partial fulfilment for the award of the degree


Of

BACHELOR OF TECHNOLOGY
IN

COMPUTER SCIENCE AND ENGINEERING

GANDHI INSTITUTE OF EXCELLENT TECHNOCRATS


BHUBANESWAR

BIJU PATNAIK UNIVERSITY OF TECHNOLOGY: ODISHA


2017-2021

1
BIJU PATNAIK UNIVERSITY OF TECHNOLOGY

BONAFIDE CERTIFICATE

Certified that this project report “CREDIT CARD FRAUD DETECTION” is the bonafide
work of “Deepika Mohanty, Monalisa Mishra, Ritik Biswal, Sagar Malik, Sandeep
Kumar Dey” who carried out the project work under my supervision.

SIGNATURE SIGNATURE SIGNATURE

Prof. Nilamadhab Mishra Prof. Chandan Kumar Panda Prof. Chandan Kumar Panda

HEAD OF THE DEPARTMENT COORDINATOR SUPERVISOR

Department of Computer Science & Engg.


Gandhi Institute of Excellent Technocrats
Ghangapatna, Khurda, Bhubaneswar
www.gietbbsr.edu.in

2
CERTIFICATE OF EVALUATION

COLLEGE NAME: GANDHI INSTITUTE OF EXCELLENT TECHNOCRATS


BRANCH: COMPUTER SCIENCE AND ENGINEERING
SEMESTER: 8

Sl. No Registration No. Name of the Student Title of the Project


1 1701348146 Deepika Mohanty Yawn Detection System
2 1701348176 Monalisa Mishra Yawn Detection System
3 1701348180 Ritik Biswal Yawn Detection System
4 1701348183 Sagar Malik Yawn Detection System
5 1701348187 Sandeep Kumar Dey Yawn Detection System

The report of the project are submitted by the above student in partial fulfilment for the awards of Bachelor
of Technology in Computer Science and Engineering, Biju Patnaik University of Technology are evaluated
and confirmed to the reports of the work done.

Submitted for the University Examination held on _________________________.

INTERNAL EXAMINER EXTERNAL EXAMINER

3
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.


ABSTRACT 7-8

1. INTRODUCTION 9-11

2. PROBLEM DEFINITION AND FEASIBILITY ANALYSIS 12-15


2.1 PROBLEM DEFINITION
2.2 FEASIBILITY ANALYSIS
3. SOFTWARE REQUIREMENT SPECIFICATIONS 16-19
3.1 INTRODUCTION
3.2 REQUIREMENT ANALYSIS
4. SYSTEM ANALYSIS 20-21
4.1 EXISTING SYSTEM
4.2 PROBLEM RECOGNITION
4.3 PROPOSED SYSTEM
5. SYSTEM DESIGN 22
5.1 ARCHITECTURAL DESIGN
6. LITERATURE REVIEW 23-24
7. METHODOLOGY 24-25
8. CODING 26-33
9. IMPLEMENTATION 34
10. RESULT 35-36
11. CONCLUSION 37
12. FUTURE ENHANCEMENTS 38
13. REFERENCES 39-40

4
LIST OF FIGURES

S.No. Title Page No.


1. FIG-1 24
2. FIG-2 25
3. FIG-3 35
4. FIG-4 36

5
ACKNOWLEDGEMENT

It is my pleasure to be indebted to various people, who directly or indirectly contributed in


the development of this work and who influenced my thinking, behaviour and acts during the
course of study.

I also take this opportunity to express a deep sense of gratitude to Er. Rama Narayan Sabat,
Vice-Chairman, Prof. Satya Prakash Das,Vice-Principal, GIET, Ghangapatana, Bhubaneswar, for his
cordial support, valuable information and guidance, which helped me in completing this task
through various stages.

I wish to express my profound and sincere gratitude to Prof. Chandan Kumar Panda, Project
coordinator Department of Computer Science & Engineering, who guided me into the intricacies of
this project nonchalantly with matchless magnanimity.

I am thankful to Prof. Nilamadhab Mishra, Head of the Dept. of Computer Science&


Engineering, for his/her support, cooperation, and motivation provided to me during the training
for constant inspiration, presence and blessings.

I also extend my sincere appreciation to Faculty members are provided valuable suggestions and
precious time in accomplishing my major project report.

Lastly, I would like to thank the almighty and my parents for their moral support and friends with
whom I shared my day-to-day experiences and received lots of suggestions those improved the
quality of work.

Deepika Mohanty (1701348146)


Monalisa Mishra (1701348176)
Ritik Biswal (1701348180)
Sagar Malik (1701348183)
Sandeep Kumar Dey(1701348187)

6
ABSTRACT

Yawning is caused by the tiredness and fatigue and this may lead to various accidents. They
can be prevented by taking effort to get enough sleep before driving, drink coffee or energy
drink, or have a rest when the signs of drowsiness occur. The popular drowsiness detection
method uses complex methods, such as EEG and ECG. This method has high accuracy for its
measurement but it need to use contact measurement and it has many limitations on
driver fatigue and drowsiness monitor. Thus, it is not comfortable to be used in real time
driving. This paper proposes a way to detect the drowsiness signs among drivers by
measuring the eye closing rate and yawning.

This project describes on how to detect the eyes and mouth in a video recorded from the
experiment conducted by MIROS (Malaysian Institute of Road Safety). In the video, a
participant will drive the driving simulation system and a webcam will be place in front of
the driving simulator. The video will be recorded using the webcam to see the transition
from awake to fatigue and finally, drowsy. The designed system deals with detecting the
face area of the image captured from the video. The purpose of using the face area so it can
narrow down to detect eyes and mouth within the face area. Once the face is found, the
eyes and mouth are found by creating the eye for left and right eye detection and also
mouth detection.

The parameters of the eyes and mouth detection are created within the face image. The
video were change into images frames per second. From there, locating the eyes and mouth
can be performed. Once the eyes are located, measuring the intensity changes in the eye
area determine the eyes are open or closed.

If the eyes are found closed for 4 consecutive frames, it is confirm that the driver is in
drowsiness condition.

7
INTRODUCTION

Drowsiness is a state of near sleep, where the person has a strong desire for sleep. It has
two distinct meanings, referring both to the usual state preceding falling asleep and the
chronic condition referring to being in that state independent of a daily rhythm [16].
Sleepiness can be dangerous when performing tasks that require constant concentration,
such as driving a vehicle. When a person is sufficiently fatigue while driving, they will
experience drowsiness and this leads to increase the factor of road accident.

Figure 1: Statistic of Road Accident from 2005 to 2009

Figure 1 shows the statistic of road accident in Malaysia from the year 2005 to 2009
provided by MIROS (Malaysia Institute of Road Safety). The numbers of vehicles involved in
road accident keep increasing each year. From Figure 1, car and taxi type of vehicles shows
about nearly 400,000 cases of road accident has been recorded. It keeps increasing every
year and by the year 2009, it shows the number of road accident were recorded by MIROS
are nearly 500,000.

Figure 2 shows the difference between fatigue and drowsiness condition.

8
Figure 2: Examples of Fatigue & Drowsiness Condition which causes yawning

The development of technologies for detecting or preventing drowsiness while driving is a


major challenge in the field of accident avoidance system. Because of the hazard that
drowsiness presents on the road, methods need to be developed for counteracting its
affects.

The aim of this project is to develop a simulation of drowsiness detection system. The
focus will be placed on designing a system that will accurately monitor the open or closed
state of the driver’s eyes and mouth. By monitoring the eyes, it is believed that the
symptoms of driver's drowsiness can be detected in sufficiently early stage, to avoid a car
accident. Yawning detection is a method to assess the driver’s fatigue. When a person is
fatigue, they keep yawning to ensure that there is enough oxygen for the brain
consumption before going to drowsiness state [17]. Detection of fatigue and drowsiness
involves a sequence of images of a face, and the observation of eyes and mouth open or
closed duration. Another method to detect eye closure is PERCLOS. This detection method
is based on the time of eyes closed which refers to percentage of a specific time.

The analysis of face images is a popular research area with applications such as face
recognition, and human identification and tracking for security systems. This project is
focused on the localization of the eyes and mouth, which involves looking at the entire
image of the face, and determining the position of the eyes and mouth, by applying the
existing methods in image processing algorithm. Once the position of the eyes is located,
the system is designed to determine whether the eyes and mouth are opened or closed,
and detect fatigue and drowsiness.

9
WHAT ARE FRAUDULENT TRANSACTIONS?

 The purpose may be to obtain goods without paying, or to obtain unauthorized funds
from an account.

 Fraudulent transactions are orders and purchases made using a credit card or bank
account that does not belong to the buyer.

 One of the largest factors in identity fraud, these types of transactions can end up
doing damage to both merchants and the identity fraud victim.

 Avoiding fraudulent transactions is in the interest of both merchants and buyers, so it


is important to take proper precautions when managing money accounts.

WHAT IS FRAUD DETECTION?

Fraud detection involves monitoring the behaviour of users in order to estimate, detect, or
avoid undesirable behaviour. To counter the credit card fraud effectively, it is necessary to
understand the technologies involved in detecting credit card frauds and to identify
various types of credit card frauds.

The credit card is a small plastic card issued to users as a system of payment. it allows its
cardholder to buy goods and services based on the cardholder's promise to pay for these
goods and services. credit card security relies on the physical security of the plastic card as
well as the privacy of the credit card number. globalization and increased use of the internet
for online shopping has resulted in a considerable proliferation of credit card transactions

10
throughout the world. thus, a rapid growth in the number of credit card transactions has led
to a substantial rise in fraudulent activities.

Credit card fraud is a wide-ranging term for theft and fraud committed using a credit card as
a fraudulent source of funds in a given transaction. credit card fraudsters employ a large
number of techniques to commit fraud. to combat the credit card fraud effectively, it is
important to first understand the mechanisms of identifying a credit card fraud. over the
years credit card fraud has stabilized much due to various credit card fraud detection and
prevention mechanisms.

In recent years, the prevailing data mining concerns people with credit card fraud detection
model based on data mining. Since our problem is approached as a classification problem,
classical data mining algorithms are not directly applicable. So an alternative approach is
made by using general purpose meta heuristic approaches like genetic algorithms.

This project is to propose a credit card fraud detection system using genetic algorithm.
Genetic algorithms are evolutionary algorithms which aim at obtaining better solutions as
time progresses. When a card is copied or stolen or lost and captured by fraudsters it is
usually used until its available limit is depleted. Thus, rather than the number of correctly
classified transactions, a solution which minimizes the total available limit on cards subject
to fraud is more prominent. It aims in minimizing the false alerts using genetic algorithm
where a set of interval valued parameters are optimized.

11
12
PROBLEM DEFINITION AND FEASIBILITY ANALYSIS

2.1 PROBLEM DEFINITION

To develop a credit card fraud detection system using genetic algorithm. During the
credit card transaction, the fraud is detected and the number of false alerts is being minimized
by using genetic algorithm. Instead of maximizing the numbers of correctly classified
transactions we defined an objective function where the misclassification costs are variable
and thus, correct classification of some transactions are more important than correctly
classifying the others.

The algorithm begins with multi-population of randomly generated chromosomes.


These chromosomes undergo the operations of selection, crossover and mutation. Crossover
combines the information from two parent chromosomes to produce new individuals,
exploiting the best of the current generation, while mutation or randomly changing some of
the parameters allows exploration into other regions of the solution space. Natural selection
via a problem specific cost function assures that only the best fit chromosomes remain in the
population to mate and produce the next generation. Upon iteration, the genetic algorithm
converges to a global solution.

1.1.1 LITERATURE SURVEY

Fraud detection has been usually seen as a data mining problem where the objective is
to correctly classify the transactions as legitimate or fraudulent. For classification problems
many performance measures are defined most of which are related with correct number of
cases classified correctly.

13
A more appropriate measure is needed due to the inherent structure of credit card
transactions. When a card is copied or stolen or lost and captured by fraudsters it is usually
used until its available limit is depleted. Thus, rather than the number of correctly classified
transactions, a solution which minimizes the total available limit on cards subject to fraud is
more prominent.

Since the fraud detection problem has mostly been defined as a classification problem, in
addition to some statistical approaches many data mining algorithms have been proposed to
solve it. Among these, decision trees and artificial neural networks are the most popular
ones. The study of Bolton and Hand provides a good summary of literature on fraud
detection problems.

However, when the problem is approached as a classification problem with variable


misclassification costs as discussed above, the classical data mining algorithms are not
directly applicable; either some modifications should be made on them or new algorithms
developed specifically for this purpose are needed. An alternative approach could be trying
to make use of general purpose meta heuristic approaches like genetic algorithms.

2.2 FEASIBILITY ANALYSIS:

A feasibility analysis is an important tool to help you assess the viability of starting a new
value-added business, or re-organizing or expanding an existing business.

All projects are feasible provided with unlimited resources and infinite time. But
unfortunately, scarcity of resources and difficult delivery dates plagues all projects.

The following three kinds of feasibilities are studied in the feasibility analysis of the project.

 Operational feasibility.

 Technical feasibility.

14
 Economical feasibility.

2.2.1. OPERATIONAL FEASIBILITY

The operational scope of the system is verified under operational feasibility. The
proposed system will have enough operational reach, which ensures the security of the
information. Hence, operational feasibility of the proposed system is found to be high.

This project involves the general user-friendly windows environment. Graphical user
Interface, being today de fecto standard, has been exploited to give the user a nice look and
feel. Operational feasibility ensures that the project is successfully implemented. The project
can be used by the users with basic internet knowledge. Hence we conclude that this project is
operationally feasible.

2.2.2. TECHNICAL FEASIBILITY

Technical feasibility checks the technical possibilities of the system to be developed.


Necessary hardware and software resources to develop the system are readily available.
Hence, the technical feasibility of the system is more. This is the study where the technical
requirements of the proposed system are checked and the efficiency of the newly developed
project to work in the existing technical requirements of the system is also checked.
Information regarding the upgrades in the technical aspects is gathered and is estimated with
the technical features of the existing system.

If the technical features that are available in the existing system are suited to accommodate the
proposed system, then the system that has been developed is said to be technically feasible.
As all the technology for this project is available in the latest Browsers, this project is
technically feasible.

15
2.2.3. ECONOMICAL FEASIBILITY

Economic analysis is the most frequently used method for evaluating the
effectiveness of a new system. More commonly it is known as cost/benefit
analysis. The software used in this project is freeware so the cost of
developing the tool is minimal. It requires very easy technique and minimal
software. So, it does not need much cost and software. So, it can be used in
any environment

16
SOFTWARE REQUIREMENTS SPECIFICATION

3.1. INTRODUCTION.

SRS is basically an organization understanding of a customer or potential


client’s system and dependencies at a particular point in time prior to any
actual design or development work. Software requirement specification has
been developed for future reference in case of any ambiguity and
misunderstanding. SRS provides a detailed of the requirements, behaviours ‘,
constraints and performance of the system.

3.2. REQUIREMENT ANALYSIS

Requirement analysis is for transformation of operational need into software


description, software performance parameter, and software configuration
through use of standard, iterative process of analysis and trade-off studies for
understanding what the customer wants analyzing need, assessing feasibility,
negotiating a reasonable solution validating the specification and managing the
requirements.

17
3.2.1. PURPOSE

The purpose of this document is to define the requirements of credit card fraud
detection. In detail, this document will provide a general description of our
project, including user requirements, product perspective, and overview of
requirements, general constraints. In addition, it will also provide the specific
requirements and functionality needed for this project - such as interface,
functional requirements and performance requirements.

3.2.2. SCOPE

The scope of this SRS document persists for the entire life cycle of the project.
This document defines the final state of the software requirements agreed upon
by the customers and designers. Finally at the end of the project execution all
the functionalities may be traceable from the SRS to the product. The document
describes the functionality, performance, constraints, interface and reliability
for the entire life cycle of the project.

18
3.2.3. OVERVIEW

The software requirement specification document for the system covers the
following two sections:

GENERAL DESCRIPTIONS:

It provides the general description about the project. It includes description


about the product function, user characteristics and general constraint.

SPECIFICATION REQUIREMENT:

This section describes about both the functional and non functional requirement
of the system. The functional requirement section defines the system external
interface, general requirement, performance, design constraint etc.

3.2.4. GENERAL DESCRIPTIONS

The credit card fraud detection system has been developed to alert the customer
regarding the fraud of their credit card. After the payment process the
transactions performed is verified whether the performed transaction is real or
fraud transaction and minimizes the false alert by implementing genetic
algorithm.

19
3.2.4.1. PRODUCT FUNCTION

The project is guaranteed to provide reliable results and the functionality of


the product to detect the fraud transactions effectively and provide flexibility
to the user in a secured and accurate manner.

3.2.4.2. USER CHARACTERISTICS

The user of the system are classified as customers and administrator,

 Customers are those who make the transaction through any means.

 Administrator who computes on the transaction and reports about the


fraud usage

3.2.4.3. GENERAL CONSTRAINTS

 Hardware Limitations: There are no hardware limitations.

 Interfaces to other Applications: There shall be no interfaces.

 Parallel Operations: There are parallel operations.

 Audit Functions: There shall be no audit functions.

 Control Functions: There shall be no control functions

20
SYSTEMANALYSIS

This chapter gives the information regarding analysis done for the
proposed system. System Analysis is done to capture the requirement of the
user of the proposed system. It also provides the information regarding the
existing system and also the need for the proposed system. The key features
of the proposed system and the requirement specifications of the proposed
system are discussed below.

4.1 EXISTING SYSTEM

The Traditional detection method mainly depends on database system and the
education of customers, which usually are delayed, inaccurate and not in-
time. After that methods based on discrimate analysis and regression analysis
are widely used which can detect fraud by credit rate for cardholders and
credit card transaction. For a large amount of data it is not efficient.

4.2 PROBLEM RECOGNITION

The high number of losses due to fraud and the awareness of the relation
between loss and the available limit has to be reduced. The fraud has to be
deducted in real time and the number of false alert has to be minimized.

21
4.3 PROPOSED SYSTEM

The proposed system overcomes the above-mentioned issue in an efficient way.


Using genetic algorithm, the fraud is detected and the false alert is minimized
and it produces an optimized result. The fraud is detected based on the
customers behaviour. A new classification problem which has a variable
misclassification cost is introduced. Here the genetic algorithms is made where
a set of interval valued parameters are optimized.

22
SYSTEM DESIGN

The process of design involves ―conceiving and planning out in mind and
making a drawing, pattern or a sketch‖. The system design transforms a logical
representation of what a given system is required to do into the physical reality
during development. Important design factors such as reliability, response time,
throughput of the system, maintainability, expandability etc., should be taken
into account. Design constraints like cost, hardware limitations, standard
compliance etc should also be dealt with. The task of system design is to take
the description and associate with it a specific set of facilities-men, machines
(computing and other), accommodation, etc., to provide complete
specifications of a workable system.

This new system must provide for all of the essential data processing and it
may also do some of those tasks identified during the work of analysis as
optional extras. It must work within the imposed constraints and show
improvement over the existing system. At the outset of design, a choice must
be made between the main approaches. Talks of ‗preliminary design‖
concerned with identification analysis and selections of the major design
options are available for development and implementation of a system. These
options are most readily distinguished in terms of the physical facilities to be
used for the processing who or what does the work.

23
LITEARTURE REVIEW

Fraud act as the unlawful or criminal deception intended to result in financial or


personal benefit. It is a deliberate act that is against the law, rule or policy with
an aim to attain unauthorized financial benefit.

Numerous literatures pertaining to anomaly or fraud detection in this domain


have been published already and are available for public usage. A
comprehensive survey conducted by Clifton Phua and his associates have
revealed that techniques employed in this domain include data mining
applications, automated fraud detection, adversarial detection. In another paper,
Suman, Research Scholar, GJUS&T at Hisar HCE presented techniques like
Supervised and Unsupervised Learning for credit card fraud detection. Even
though these methods and algorithms fetched an unexpected success in some
areas, they failed to provide a permanent and consistent solution to fraud
detection.

A similar research domain was presented by Wen-Fang YU and Na Wang


where they used Outlier mining, Outlier detection mining and Distance sum
algorithms to accurately predict fraudulent transaction in an emulation
experiment of credit card transaction data set of one certain commercial bank.
Outlier mining is a field of data mining which is basically used in monetary and
internet fields. It deals with detecting objects that are detached from the main
system i.e. the transactions that aren’t genuine. They have taken attributes of
customer’s behaviour and based on the value of those attributes they’ve
calculated that distance between the observed value of that attribute and its
predetermined value.

Unconventional techniques such as hybrid data mining/complex network


classification algorithm is able to perceive illegal instances in an actual card
transaction data set, based on network reconstruction algorithm that allows
creating representations of the deviation of one instance from a reference group
have proved efficient typically on medium sized online transaction.

There have also been efforts to progress from a completely new aspect.
Attempts have been made to improve the alert-feedback interaction in case of
fraudulent transaction.

In case of fraudulent transaction, the authorised system would be alerted and a


feedback would be sent to deny the ongoing transaction.

24
Artificial Genetic Algorithm, one of the approaches that shed new light in this
domain, countered fraud from a different direction.

It proved accurate in finding out the fraudulent transactions and minimizing the
number of false alerts. Even though, it was accompanied by classification
problem with variable misclassification costs.

METHODOLOGY

The approach that this paper proposes, uses the latest machine learning
algorithms to detect anomalous activities, called outliers.

The basic rough architecture diagram can be represented with the following
figure:

Fig-1

25
When looked at in detail on a larger scale along with real life elements, the full
architecture diagram can be represented as follows:

Fig-2

First of all, we obtained our dataset from Kaggle, a data analysis website which
provides datasets.

Inside this dataset, there are 31 columns out of which 28 are named as v1-v28
to protect sensitive data.

The other columns represent Time, Amount and Class. Time shows the time
gap between the first transaction and the following one. Amount is the amount
of money transacted. Class 0 represents a valid transaction and 1 represents a
fraudulent one.

26
CODING

Credit Card Fraud Detection


Throughout the financial sector, machine learning algorithms are being developed to detect
fraudulent transactions. In this project, that is exactly what we are going to be doing as well.
Using a dataset of nearly 28,500 credit card transactions and multiple unsupervised anomaly
detection algorithms, we are going to identify transactions with a high probability of being credit
card fraud. In this project, we will build and deploy the following two machine learning algorithms:

 Local Outlier Factor (LOF)


 Isolation Forest Algorithm

Furthermore, using metrics such as precision, recall, and F1-scores, we will investigate why the
classification accuracy for these algorithms can be misleading.

In addition, we will explore the use of data visualization techniques common in data science,
such as parameter histograms and correlation matrices, to gain a better understanding of the
underlying distribution of data in our data set. Let's get started!

1. Importing Necessary Libraries


In [1]:

importsys
importnumpy
importpandas
importmatplotlib
importseaborn
importscipy

print('Python: {}'.format(sys.version))
print('Numpy: {}'.format(numpy.__version__))
print('Pandas: {}'.format(pandas.__version__))
print('Matplotlib: {}'.format(matplotlib.__version__))
print('Seaborn: {}'.format(seaborn.__version__))
print('Scipy: {}'.format(scipy.__version__))
Python: 2.7.13 |Continuum Analytics, Inc.| (default, May 11 2017, 13:17:26)
[MSC v.1500 64 bit (AMD64)]
Numpy: 1.14.0
Pandas: 0.21.0
Matplotlib: 2.1.0
Seaborn: 0.8.1
Scipy: 1.0.0

27
In [2]:
# import the necessary packages
importnumpyasnp
importpandasaspd
importmatplotlib.pyplotasplt
importseabornassns

2. The Data Set


In the following cells, we will import our dataset from a .csv file as a Pandas DataFrame.
Furthermore, we will begin exploring the dataset to gain an understanding of the type, quantity,
and distribution of data in our dataset. For this purpose, we will use Pandas' built-in describe
feature, as well as parameter histograms and a correlation matrix.

In [3]:
# Load the dataset from the csv file using pandas
data =pd.read_csv('creditcard.csv')

In [4]:
# Start exploring the dataset
print(data.columns)
Index([u'Time', u'V1', u'V2', u'V3', u'V4', u'V5', u'V6', u'V7', u'V8',
u'V9',
u'V10', u'V11', u'V12', u'V13', u'V14', u'V15', u'V16', u'V17',
u'V18',
u'V19', u'V20', u'V21', u'V22', u'V23', u'V24', u'V25', u'V26',
u'V27',
u'V28', u'Amount', u'Class'],
dtype='object')

In [5]:
# Print the shape of the data
data =data.sample(frac=0.1, random_state=1)
print(data.shape)
print(data.describe())

# V1 - V28 are the results of a PCA Dimensionality reduction to protect


user identities and sensitive features
(28481, 31)
Time V1 V2 V3
V4 \
count 28481.000000 28481.000000 28481.000000 28481.000000
28481.000000
mean 94705.035216 -0.001143 -0.018290 0.000795
0.000350

28
std 47584.727034 1.994661 1.709050 1.522313
1.420003
min 0.000000 -40.470142 -63.344698 -31.813586
-5.266509
25% 53924.000000 -0.908809 -0.610322 -0.892884
-0.847370
50% 84551.000000 0.031139 0.051775 0.178943
-0.017692
75% 139392.000000 1.320048 0.792685 1.035197
0.737312
max 172784.000000 2.411499 17.418649 4.069865
16.715537

V5 V6 V7 V8 V9
\
count 28481.000000 28481.000000 28481.000000 28481.000000 28481.000000
mean -0.015666 0.003634 -0.008523 -0.003040 0.014536
std 1.395552 1.334985 1.237249 1.204102 1.098006
min -42.147898 -19.996349 -22.291962 -33.785407 -8.739670
25% -0.703986 -0.765807 -0.562033 -0.208445 -0.632488
50% -0.068037 -0.269071 0.028378 0.024696 -0.037100
75% 0.603574 0.398839 0.559428 0.326057 0.621093
max 28.762671 22.529298 36.677268 19.587773 8.141560

... V21 V22 V23 V24


\
count ... 28481.000000 28481.000000 28481.000000 28481.000000
mean ... 0.004740 0.006719 -0.000494 -0.002626
std ... 0.744743 0.728209 0.645945 0.603968
min ... -16.640785 -10.933144 -30.269720 -2.752263
25% ... -0.224842 -0.535877 -0.163047 -0.360582
50% ... -0.029075 0.014337 -0.012678 0.038383
75% ... 0.189068 0.533936 0.148065 0.434851
max ... 22.588989 6.090514 15.626067 3.944520

V25 V26 V27 V28 Amount


\
count 28481.000000 28481.000000 28481.000000 28481.000000 28481.000000
mean -0.000917 0.004762 -0.001689 -0.004154 89.957884
std 0.520679 0.488171 0.418304 0.321646 270.894630
min -7.025783 -2.534330 -8.260909 -9.617915 0.000000
25% -0.319611 -0.328476 -0.071712 -0.053379 5.980000
50% 0.015231 -0.049750 0.000914 0.010753 22.350000
75% 0.351466 0.253580 0.090329 0.076267 78.930000
max 5.541598 3.118588 11.135740 15.373170 19656.530000

Class

29
count 28481.000000
mean 0.001720
std 0.041443
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000

[8 rows x 31 columns]

In [6]:
# Plot histograms of each parameter
data.hist(figsize= (20, 20))
plt.show()

30
In [7]:
# Determine number of fraud cases in dataset

Fraud = data[data['Class'] ==1]


Valid = data[data['Class'] ==0]

outlier_fraction=len(Fraud)/float(len(Valid))
print(outlier_fraction)

print('Fraud Cases: {}'.format(len(data[data['Class'] ==1])))


print('Valid Transactions: {}'.format(len(data[data['Class'] ==0])))
0.00172341024198
Fraud Cases: 49
Valid Transactions: 28432

In [8]:
# Correlation matrix
corrmat=data.corr()
fig =plt.figure(figsize= (12, 9))

sns.heatmap(corrmat, vmax=.8, square =True)


plt.show()

31
In [9]:
# Get all the columns from the dataFrame
columns =data.columns.tolist()

# Filter the columns to remove data we do not want


columns = [c for c in columns if c notin ["Class"]]

# Store the variable we'll be predicting on


target ="Class"

X = data[columns]
Y = data[target]

# Print shapes
print(X.shape)
print(Y.shape)
(28481, 30)
(28481L,)

32
3. Unsupervised Outlier Detection
Now that we have processed our data, we can begin deploying our machine learning algorithms.
We will use the following techniques:

Local Outlier Factor (LOF)

The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation
of density of a given sample with respect to its neighbors. It is local in that the anomaly score
depends on how isolated the object is with respect to the surrounding neighbourhood.

Isolation Forest Algorithm

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly
selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splitting
required to isolate a sample is equivalent to the path length from the root node to the terminating
node.

This path length, averaged over a forest of such random trees, is a measure of normality and our
decision function.

Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of
random trees collectively produce shorter path lengths for particular samples, they are highly
likely to be anomalies.

In [11]:
fromsklearn.metricsimportclassification_report, accuracy_score
fromsklearn.ensembleimportIsolationForest
fromsklearn.neighborsimportLocalOutlierFactor

# define random states


state =1

# define outlier detection tools to be compared


classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors=20,
contamination=outlier_fraction)}

In [15]:
# Fit the model
plt.figure(figsize=(9, 7))
n_outliers=len(Fraud)

fori, (clf_name, clf) inenumerate(classifiers.items()):

33
# fit the data and tag outliers
ifclf_name=="Local Outlier Factor":
y_pred=clf.fit_predict(X)
scores_pred=clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred=clf.decision_function(X)
y_pred=clf.predict(X)

# Reshape the prediction values to 0 for valid, 1 for fraud.


y_pred[y_pred==1] =0
y_pred[y_pred==-1] =1

n_errors= (y_pred!= Y).sum()

# Run classification metrics


print('{}: {}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred))
print(classification_report(Y, y_pred))
Local Outlier Factor: 97
0.9965942207085425
precision recall f1-score support

0 1.00 1.00 1.00 28432


1 0.02 0.02 0.02 49

avg / total 1.00 1.00 1.00 28481

Isolation Forest: 71
0.99750711000316
precision recall f1-score support

0 1.00 1.00 1.00 28432


1 0.28 0.29 0.28 49

avg / total 1.00 1.00 1.00 28481

34
IMPLEMENTATION
This idea is difficult to implement in real life because it requires the cooperation
from banks, which aren’t willing to share information due to their market
competition, and also due to legal reasons and protection of data of their users.

Therefore, we looked up some reference papers which followed similar


approaches and gathered results. As stated in one of these reference papers:

“This technique was applied to a full application data set supplied by a German
bank in 2006. For banking confidentiality reasons, only a summary of the
results obtained is presented below. After applying this technique, the level 1
list encompasses a few cases but with a high probability of being fraudsters.

All individuals mentioned in this list had their cards closed to avoid any risk due
to their high-risk profile. The condition is more complex for the other list. The
level 2 list is still restricted adequately to be checked on a case-by-case basis.

Credit and collection officers considered that half of the cases in this list could
be considered as suspicious fraudulent behaviour. For the last list and the
largest, the work is equitably heavy. Less than a third of them are suspicious.

In order to maximize the time efficiency and the overhead charges, a possibility
is to include a new element in the query; this element can be the five first digits
of the phone numbers, the email address, and the password, for instance, those
new queries can be applied to the level 2 list and level 3 list.”.

RESULTS
35
The code prints out the number of false positives it detected and compares it
with the actual values. This is used to calculate the accuracy score and precision
of the algorithms.

The fraction of data we used for faster testing is 10% of the entire dataset. The
complete dataset is also used at the end and both the results are printed.

These results along with the classification report for each algorithm is given in
the output as follows, where class 0 means the transaction was determined to be
valid and 1 means it was determined as a fraud transaction.

This result matched against the class values to check for false positives.

Results when 10% of the dataset is used:

Fig-3

Results with the complete dataset is used:

36
Fig-4

CONCLUSION
37
Credit card fraud is without a doubt an act of criminal dishonesty. This article
has listed out the most common methods of fraud along with their detection
methods and reviewed recent findings in this field. This paper has also
explained in detail, how machine learning can be applied to get better results in
fraud detection along with the algorithm,

pseudocode, explanation its implementation and experimentation results.

While the algorithm does reach over 99.6% accuracy, its precision remains
only at 28% when a tenth of the data set is taken into consideration. However,
when the entire dataset is fed into the algorithm, the precision rises to 33%.
This high percentage of accuracy is to be expected due to the huge imbalance
between the number of valid and number of genuine transactions.
Since the entire dataset consists of only two days’ transaction records, its only a
fraction of data that can be made available if this project were to be used on a
commercial scale. Being based on machine learning algorithms, the program
will only increase its efficiency over time as more data is put into it.

FUTURE ENHANCEMENT

38
While we couldn’t reach out goal of 100% accuracy in fraud detection, we did
end up creating a system that can, with enough time and data, get very close to
that goal. As with any such project, there is some room for improvement here.

The very nature of this project allows for multiple algorithms to be integrated
together as modules and their results can be combined to increase the accuracy
of the final result.

This model can further be improved with the addition of more algorithms into it.
However, the output of these algorithms needs to be in the same format as the
others. Once that condition is satisfied, the modules are easy to add as done in
the code. This provides a great degree of modularity and versatility to the
project.

More room for improvement can be found in the dataset. As demonstrated


before, the precision of the algorithms increases when the size of dataset is
increased. Hence, more data will surely make the model more accurate in
detecting frauds and reduce the number of false positives. However, this
requires official support from the banks themselves.

39
REFERENCES

[1] “Credit Card Fraud Detection Based on Transaction Behaviour -by John
Richard D. Kho, Larry A. Vea” published by Proc. of the 2017
IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017

[2] CLIFTON PHUA1, VINCENT LEE1, KATE SMITH1 & ROSS


GAYLER2 “ A Comprehensive Survey of Data Mining-based Fraud
Detection Research” published by School of Business Systems, Faculty of
Information Technology, Monash University, Wellington Road, Clayton,
Victoria 3800, Australia

[3] “Survey Paper on Credit Card Fraud Detection by Suman” , Research


Scholar, GJUS&T Hisar HCE, Sonepat published by International Journal
of Advanced Research in Computer Engineering & Technology (IJARCET)
Volume 3 Issue 3, March 2014
[4] “Research on Credit Card Fraud Detection Model Based on Distance Sum –
by Wen-Fang YU and Na Wang” published by 2009
International Joint Conference on Artificial Intelligence

[5] “Credit Card Fraud Detection through Parenclitic Network Analysis-By


Massimiliano Zanin, Miguel Romance, ReginoCriado, and
SantiagoMoral” published by Hindawi Complexity Volume 2018,
Article ID 5764370, 9 pages

[6] “Credit Card Fraud Detection: A Realistic Modeling and a Novel Learning
Strategy” published by IEEE TRANSACTIONS ON
NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 8,
AUGUST 2018

[7] “Credit Card Fraud Detection-by Ishu Trivedi, Monika, Mrigya, Mridushi”
published by International Journal of Advanced Research in Computer and
Communication Engineering Vol. 5, Issue 1, January 2016
[8] David J.Wetson,DavidJ.Hand,MAdams,Whitrow and Piotr Jusczak
“Plastic Card Fraud Detection using Peer Group Analysis” Springer,
Issue 2008.

40
41

You might also like