Professional Documents
Culture Documents
Submitted by
JEGADEESH.J
(Reg No.20CSEE12)
The Project entitled CREDIT CARD FRAUD DETECTION has become one
of the growing problems. A large financial loss has greatly affected individual people
using credit cards and also the merchants and banks. Machine learning is considered as
one of the most successful techniques to identify fraud. This project reviews different
fraud detection techniques using machine learning and compares them using
performance measures like accuracy, precision and specificity. The project also
proposes an FDS (Fault Detection System) which uses supervised Logistic Regression
and Decision Tree. With this proposed system the accuracy of detecting fraud in credit
card is increased. Further, the proposed system uses the learning to rank approach to
rank the alert and also effectively addresses the problem of concept drift in fraud
detection.
CONTENT
INDEX PAG
E
DECLARATION i
CERTIFICATE i
i
ACKNOWLEDGEMENT i
i
i
ABSTRACT i
v
INTRODUCTION
.3
1.3 PYTHON
.1
1.3 ANACONDA
.2
2 EXISTING SYSTEM
.1
2 PROPOSED SYSTEM
.2
SELECTION OF THE
ORGANIZATION
PROBLEM FORMULATION
4 MAIN OBJECTIVE
.
1
4 CONFIGURATION SUPPORT
.
2
4 INSTALLATION INSTRUCTIONS
.
3
5 FACT FINDING
.
1
5 FEASIBILITY ANALYSIS
.
2
5 INPUT DESIGN
.
3
5 OUTPUT DESIGN
.
4
5 MENU DESIGN
.
5
CHAPTER-I
INTRODUCTION
1. INTRODUCTION
Credit card fraud is a major problem that involves payment cards like credit cards
as illegal sources of funds in transactions. Fraud is an illegal way to obtain goods
and funds. The goal of such illegal transactions might be to get products without
paying or gain an unauthorized fund from an account. Identifying such fraud is
troublesome and may risk the business and business organizations. Here the fraud
detection system monitors all the approved transactions and alerts the most
suspicious one. Investigator verifies these alerts and provides FDS with feedback if
the transaction was authorized or fraudulent. Verifying all the alerts everyday is a
time consuming and costly process. Hence the investigator is able to verify only a
few alerts each day. The rest of the transactions remain unchecked until the customer
identifies them and reports them as a fraud. Also the techniques used for fraud and
the cardholder spending behavior changes over time. This change in credit card
transactions is called concept drift. Hence most of the time it is difficult to identify
the credit card fraud. Machine Learning is considered as one of the most
successful techniques for fraud identification. It uses a classification and regression
approach for recognizing fraud in credit cards. Many learning algorithms have been
presented for fraud detection in credit cards which includes logistic regression and
decision tree. This project examines the performance of above algorithms based
on their ability to classify whether the transaction was authorized or fraudulent
and then compares them. The comparison is made using performance measure
accuracy, specificity and precision. The result showed that logistic regression and
decision algorithms showed better accuracy and precision than other techniques.
PORTABLE
Python can run on a wide variety of hardware platforms and has the same interface
on all platforms.
EXTENDABLE
It allows adding low-level modules to the Python interpreter. These modules enable
programmers to add to or customize their tools to be more efficient.
DATABASES
Python provides interfaces to all major commercial databases.
GUI Programming
Python supports GUI applications that can be created and ported to many system
calls, libraries and windows systems, such as Windows MFC, Macintosh, and the X
Window system of Unix.
SCALE ABLE
Python provides a better structure and support for large programs than shell
scripting.
OBJECT-ORIENTED APPROACH
One of the key aspects of Python is its object-oriented approach. This basically
means that Python recognizes the concept of class and object encapsulation thus
allowing programs to be efficient in the long run.
HIGHLY DYNAMIC
Python is one of the most dynamic languages available in the industry today. There
is no need to specify the type of the variable during coding, thus saving time and
increasing efficiency.
EXTENSIVE ARRAY OF LIBRARIES
Python comes inbuilt with many libraries that can be imported at any instance and be
used in a specific program.
OPEN SOURCE AND FREE
Python is an open-source programming language which means that anyone can
create and contribute to its development. Python is free to download and use in any
operating system, like Windows, Mac or Linux.
1.3.2 ANACONDA
Anaconda is a free and open-source distribution of the Python and R
programming languages for scientific computing (data science,machine learning
applications, large-scale data processing, predictive analytic, etc.), that aims to simplify
package management and deployment. Package versions are managed by the package
management system .The Anaconda distribution includes data-science packages
suitable for Windows, Linux, and Mac OS.
Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows users to launch applications and manage conda
packages, environments and channels without using command-line commands.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda
Repository, install them in an environment, run the packages and update them. It is
available for Windows, Mac OS and Linux.
1.3.3 JUPYTER NOTEBOOK
"Jupyter" is a loose acronym meaning Julia, Python, and R. These programming
languages were the first target languages of the Jupyter application. As a server-client
server, it can be accessed through the Internet.
A application, the Jupyter Notebook App allows you to edit and run your
notebooks via a web browser. The application can be executed on a PC without
Internet access, or it can be installed on a remote kernel is a program that runs and
introspects the user’s code. The Jupyter Notebook App has a kernel for Python
code."Notebook" or "Notebook documents" denote documents that contain both code
and rich text elements, such as figures, links and equations. The mix of code and text
elements, these documents are the ideal place to bring together an analysis description,
and can be executed to perform the data analysis in real time.
CHAPTER- II
BACKGROUND STUDY
2.1 EXISTING SYSTEM
In the early days a large financial loss has greatly affected individual people
using credit cards and also the merchants and banks. Here the unauthorized person
easily made the transaction instead of the authorized one.
The goal of such illegal transactions might be to get products without paying
or gain an unauthorized fund from an account.
CHAPTER-III
3.1 MAIN OBJECTIVE
The use of credit cards to perform financial transactions at banks or other
institutions is a common action in light of the currently available technology. Online
payments (or any other online transactions) bring benefits to companies and individuals
in terms of the convenience, velocity, and flexibility of performing daily duties. The
work presented a statistical analysis related to the usage of credit cards over five years.
This reflected the huge dependency on credit cards by both people and organizations.
To take advantage of advanced technologies, companies try to use advanced techniques
to provide high quality services to customers. Automation can be seen as the best
solution for attracting more customers and consequently collecting more financial gain.
The process of converting a manual system to a fully automatic one, as found in smart
cities, is not without risk.
3.2 METHODOLOGY
We have gathered data from the kaggle website. The dataset is trained and tested
using the following techniques: logistic regression, random forests with decision trees,
xgboost, isolation forest and confusion matrix . If our algorithm is applied into bank
credit card fraud detection systems, the probability of fraud transactions can be
predicted soon after credit card transaction occurs. Thereafter a series of anti-fraud
strategies can be adopted to prevent banks from great losses and reduce risks.
SOFTWARE CONFIGURATION
CHAPTER-IV
SYSTEM ANALYSIS AND DESIGN
4.1 FEASIBILITY STUDY
● TECHNICAL FEASIBILITY
● OPERATIONAL FEASIBILITY
TECHNICAL FEASIBILITY
This phase focuses on the technical resources available to the organization. It helps
organizations determine whether the technical resources meet capacity and whether the
ideas can be converted into a working system model. Technical feasibility also
involves the evaluation of the hardware, software, and other technical requirements of
the proposed system.
OPERATIONAL FEASIBILITY
This phase involves undertaking a study to analyse and determine how well the
organization’s needs can be met by completing the project. Operational feasibility
study also examines how a project plan satisfies the requirements that are needed for
the phase of system development.
Decision Tree
Data Virtualization
Feedback Logistic
Data Modeling Regression
CHAPTER-V
SYSTEM DEVELOPMENT
5. DESCRIPTION OF MODULES
● DATA PREPROCESSING
● DATA EXPLORATION
● DATA VIRTUALIZATION
● DATA MODELING
5.1 DATA PREPROCESSING
In this module selected data is formatted, cleaned and sampled. The data
preprocessing steps includes following:
● Formatting: The data which is been selected may not be in a suitable format. The
data may be in a file format and we may like it in relational database or vice
versa.
● Cleaning: Removal or fixing of missing data is called as cleaning. The dataset
may contain record which may be incomplete or it may have null values. Such
records need to remove.
● Sampling: As number of frauds in dataset is less than overall transaction,
class distribution is unbalanced in credit card transaction. Hence sampling
method is used to solve this issue.
5.2 DATA EXPLORATION
In the data modeling module, the machine learning algorithms were used to
predict the sales. Linear regression and K-means algorithm were used to predict the
sales. The user provides the ML algorithm with a dataset that includes desired inputs
and outputs, and the algorithm finds a method to determine how to arrive at those
results.
5.5 PREDICTION
The Credit Card Fraud Detection Problem includes modeling past credit
card transactions with the knowledge of the ones that turned out to be fraud. This
model is then used to identify whether a new transaction is fraudulent or not.
CHAPTER – VI
SYSTEM TESTING
System testing is the stage of implementation that is aimed at ensuring that the
system works accurately and efficiently before live operation commences. Testing is
vital to the success of the system. System testing makes a logical assumption that if all
the parts of the system are correct, then the goal will be successfully achieved. System
testing involves user training system testing and successful running of the developed
proposed system. The user tests the developed system and changes are made per their
needs. The testing phase involves the testing of a developed system using various kinds
of data. While testing, errors are noted and the corrections are made. The corrections
are also noted for future use.
Unit testing focuses verification effort on the smallest unit of software design,
software component or module. Using the component level design description as a
control, paths are tested to uncover errors within the boundary of the module. The
relative complexity of tests and the errors those uncover is limited by the constrained
scope established for unit testing. The unit test focuses on the internal processing logic
and data structures within the boundaries of a component. This is normally considered
as an adjunct to the coding step. The design of unit tests can be performed before
coding begins.
Black box testing, also called behavioral testing, focuses on the functional
requirement of the software. This testing enables us to derive a set of input conditions
of all functional requirements for a program. This technique focuses on the information
domain of the software, deriving test cases by partitioning the input and output of a
program.
6.4 WHITE BOX TESTING
White box testing, also called as glass box testing, is a test case design that uses
the control structures described as part of component level design to derive test cases.
This test case is derived to ensure all statements in the program have been executed at
least once during the testing and that all logical conditions have been exercised.
CHAPTER-VI
CONCLUSION & FUTURE ENHANCEMENT
CONCLUSION
Decision trees and Logistic Regression algorithms were used in developing four
fraud detection models to classify a transaction as fraudulent or legitimate. Decision
tree algorithm used to show the prediction with increased accuracy rate. Logistic
Regression algorithm implements a statistical model when relationships between the
independent variables and the dependent variable are almost linear, shows optimal
results. Also the results showed that there is no data mining technique that is
universally better than others. Performance improvement could be achieved through
developing a fraud detection model using a combination of different data mining
techniques.
FUTURE ENHANCEMENT
Advances in technology give criminals increasingly powerful tools to commit
fraud, especially using credit cards or internet bots. To combat the evolving face of
fraud, researchers are developing increasingly sophisticated tools, with algorithms and
data structures capable of handling large-scale complex data analysis and storage. This
system is capable of providing most of the essential features required to detect
fraudulent and legitimate transactions. As technology changes, it becomes difficult to
track the behavior and pattern of fraudulent transactions. In future the accuracy of
detecting fraud in credit cards will be increased. Further, the proposed system use
learning to rank approach to rank the alert and also effectively addresses
The problem concept drift in fraud detection.
REFERENCES
2. ”Detection of fake profile in online social networks using Machine Learning” Naman
3. ”Detecting Fake accounts on Social Media” Sarah Khaled, Neamat el tazi, Hoda M.O.
Mokhtar.
4. ”Twitter fake account detection”, Buket Ersahin, Ozlem Aktas, Deniz kilinc, Ceyhun
Akyol.
5. ” a new heuristic of the decision tree induction” ning li, li zhao, ai-xia chen, qing-wu
6. ” statistical machine learning used in integrated anti-spam system” peng-fei zhang, yu-jie
changjun zhu.
8. ” learning-based road crack detection using gradient boost decision tree” peng sheng, li
9. ” verifying the value and veracity of extreme gradient boosted decision trees on a variety
SCREENSHOTS
IMPORTING REQUIRED LIBRARIES
FEATURE EXTRACTION
CORRELATION VALUES