Professional Documents
Culture Documents
Batch 31
Batch 31
A PROJECT REPORT
Submitted by
NITHIN M (312417104061)
SABHARAM M (312417104083)
of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
CHENNAI - 119
APRIL-2021
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr. J. DAFNI ROSE M.E, Ph.D., Mrs. V. NISHA JENIPHER, M.E., (Ph.D.),
Professor, Assistant Professor,
Head of the Department, Computer Science and Engineering,
Computer Science and Engineering, St. Joseph’s Institute of Technology,
St. Joseph’s Institute of Technology, Old Mamallapuram road,
Old Mamallapuram road, Chennai - 600 119.
Chennai - 600 119.
i
ACKNOWLEDGEMENT
We also take this opportunity to thank our honourable Chairman Dr. B. Babu
Manoharan, M.A., M.B.A., Ph.D. for the guidance he offered during our tenure in
this institution.
We express our deep gratitude to our honourable CEO Mr. B. Sashi Sekar,
M.Sc (INTL.Business) for the constant guidance and support for our project.
Our earnest gratitude to our Head of the Department Dr. J. Dafni Rose, M.E.,
Ph.D., for her commendable support and encouragement for the completion of the
project with perfection.
We express our profound gratitude to our guide Mr. J. Nisha Jenipher, M.E.,
(Ph.D.), for his guidance, constant encouragement, immense help and valuable advice
for the completion of this project.
We wish to convey our sincere thanks to all the teaching and non-teaching
staff of the department of COMPUTER SCIENCE AND ENGINEERING without
whose co-operation this venture would not have been a success.
ii
CERTIFICATE OF EVALUATION
Semester: VIII
The reports of the project work submitted by the above students in partial
Submitted for project review and viva voce exam held on ________________
iii
ABSTRACT
Machine Learning is the field of study that provides computers ability to learn without
explicitly being programmed. It focuses on the development of computer programs that can
access data and use it learn for themselves. Training is given on dataset and testing is done on
new dataset to get desired result. This process is repeated many times to improvise the learning
process of computer. It enables computers to handle tasks which are done by humans. It is the
process of teaching a computer system to make accurate predictions when fed data. A dataset
was taken by European company to find accuracy of number of fraud transactions that has
happened genuinely. It is found using an algorithm in machine learning called logistic
regression. This technique showed that the accuracy of 99.962% of genuine data and 79.065%
of fraud data. Accuracy of the algorithm showed 50%. In order to avoid and increase the
accuracy of existing system, another algorithm was taken into consideration which is Random
Forest. This algorithm is stable as it processes large number of dataset and gives exact accuracy
value even if new data is added or any missing value is present in dataset. It showed that
accuracy is 99.988% of genuine data and 42.683% of fraud data. Various libraries are imported
and dataset s collected and loaded using pandas library. Data is explored as numerical values
are defined or not and data imbalance is checked and make it balanced if needed. Data is split
and plotted to calculate the percentage of fraud transactions and valid transactions and
correlate the features which are relevant to each other for prediction. The purpose of this
algorithm is to obtain goods without paying or to obtain unauthorized frauds in an account.
The e-commerce website is used as an example to detect number of fraud transactions made
in customer’s account. So the products which are not bought by customer but paid are known
and recognized that fraud transactions are made. Credit card Fraud Detection
iv
TABLE OF CONTENTS
CHAPTER PAGE
TITLE
NO NUMBER
ABSTRACT iv
LIST OF TABLES vi
LIST OF FIGURES vii
1. INTRODUCTION 1
1.1 Overview 1
1.2 Problem Statement 1
1.3 Existing System 2
1.4 Proposed System 3
2. LITERATURE SURVEY 4
3. SYSTEM DESIGN 6
3.1 System Requirements 6
3.2 UML Flow Diagrams 6
3.2.1 Use Case Diagram of Credit card 6
Fraud Detection
v
4. SYSTEM ARCHITECTURE 13
4.1 Architectural Design 13
4.2 List of Modules 14
4.2.1 Preprocessing module 14
4.2.2 Machine Learning Module 14
4.2.3 Data exploration Module 14
5. SYSTEM IMPLEMENTATION 15
5.1 System Description 15
5.2 Pseudo code for Random Forest Algorithm 15
5.4 System Accuracy 15
6. RESULTS AND CODING 16
6.1 Sample Code 16
6.1.1 ML Model Code 16
6.1.2 Dataset File 16
6.2 Screenshots 17
7. CONCLUSION AND FUTURE WORK 20
7.1 Conclusion 20
7.2 Future Work 20
References
21
vi
LIST OF FIGURES
vii
CHAPTER 1
INTRODUCTION
To analyze and identify percentage of the fraudulent in the given data set.
Now a days, technology had been improvised and frauds have been raising rapidly.
In banking sector, fraudulent activities in credit-card have been increased. In our
model, main process is to make accurate predictions when data is fed. With the use
of Machine Learning, we analyze and summarize the frauds in the credit card
transactions.
1.1 OVERVIEW
The Credit Card Fraud Detection Problem includes modeling past credit
card transactions with the knowledge of the ones that turned out to be fraud. This
model is then used to identify whether a new transaction is fraudulent or not. Our
aim here is to detect 100% of the fraudulent transactions while minimizing the
incorrect fraud classifications.
Enormous Data is processed every day and the model build must be fast
enough to respond to the scam in time. Imbalanced Data i.e most of the
transactions (99.8%) are not fraudulent which makes it really hard for detecting
the fraudulent ones. Data availability as the data is mostly private. Misclassified
Data can be another major issue, as not every fraudulent transaction is caught and
reported. Adaptive techniques used against the model by the scammers.
1
1.3.1 INPUT DATA SET
The data set is based on real life transactional data by a large European
company and personal details in data is kept confidential. Accuracy of an
algorithm is around 50%.
A research about a case study involving credit card fraud detection, where
data normalization is applied before Cluster Analysis and with results obtained
from the use of Cluster Analysis and Artificial Neural Networks on fraud detection
has shown that by clustering attributes neuronal inputs can be minimized.
To find an algorithm and to reduce the cost measure, the result obtained was by
23% and the algorithm they found was Bayes minimum risk.
2
In this proposed project we designed a protocol or a model to detect the
fraud activity in credit card transactions. This system is capable of providing most
of the essential features required to detect fraudulent and legitimate transactions.
Various libraries are imported and dataset s collected and loaded using
pandas library. Data is explored as numerical values are defined or not and data
imbalance is checked and make it balanced if needed. Data is split and plotted to
calculate the percentage of fraud transactions and valid transactions and correlate
the features which are relevant to each other for prediction.
The Random Forest algorithm has been found to provide a good estimate
of the generalization error and to be resistant to over fitting.
3
CHAPTER 2
LITERATURE SURVEY
Enormous Data is processed every day and the model build must be fast
enough to respond to the scam in time. Imbalanced Data i.e most of the
transactions (99.8%) are not fraudulent which makes it really hard for detecting
the fraudulent ones. Data availability as the data is mostly private. Misclassified
Data can be another major issue, as not every fraudulent transaction is caught and
reported. Deep learning algorithm is used to learn from dataset automatically
without training on dataset.
4
show that k-nearest neighbor performs better than naïve bayes and logistic
regression techniques. Credit card fraud detection, which is a data mining
problem, becomes challenging due to two major reasons – first, the profiles of
normal and fraudulent behaviors change constantly and secondly, credit card fraud
data sets are highly skewed.
Expected future areas of research could be in examining meta-classifiers and meta
learning approaches in handling highly imbalanced credit card fraud data. Also
effects of other sampling approaches can be investigated.
time. This model can further be improved with the addition of more algorithms
into it. However, the output of these algorithms needs to be in the same format as
the others. Once that condition is satisfied, the modules are easy to add as done in
the code. This provides a great degree of modularity and versatility to the project.
5
CHAPTER 3
SYSTEM DESIGN
Use case diagrams are considered for high level requirement analysis
of a system. So when the requirements of a system are analysed the
6
functionalities are captured in use cases. So it can be said that uses cases
are nothing but the system functionalities written in an organized manner.
Now the second things which are relevant to the use cases are the actors.
Actors can be defined as something that interacts with the system. The actors
can be human user, some internal applications or may be some external
applications. Use case diagrams are used to gather the requirements of a
system including internal and external influences. These requirements are
mostly design requirements. Hence, when a system is analyzed to gather
its functionalities, use cases are prepared and actors are identified.
UML sequence diagrams model the flow of logic within the system in
a visual manner, enabling to both document and validate the logic, and are commonly
used for both analysis and design purposes.
The various actions that take place in the application in the correct
sequence are shown in the above figure. Sequence diagrams are the most popular UML
for dynamic modeling.
7
3.2.3 Activity Diagram of Credit card Fraud Detection
8
3.2.4 Component Diagram of Credit card Fraud Detection
9
3.2.6 Deployment Diagram of Credit card Fraud Detection
10
3.2.7 Package Diagram of Credit card Fraud Detection
11
Package diagrams are used to reflect the organization of packages and
their elements. When used to represent class elements, package diagrams provide
a visualization of the namespaces.
Package diagrams are used to structure high level system elements.
Package diagrams can be used to simplify complex class diagrams, it can
group classes into packages. A package is a collection of logically related UML
elements.
Packages are depicted as file folders and can be used on any of
the UML diagrams. The Figure represents package diagram for the developed
application which represents how the elements are logically related.
12
CHAPTER 4
SYSTEM ARCHITECHTURE
The Credit Card Fraud Detection Problem includes modeling past credit
card transactions with the knowledge of the ones that turned out to be fraud. This
model is then used to identify whether a new transaction is fraudulent or not. Our
aim here is to detect 100% of the fraudulent transactions while minimizing the
incorrect fraud classifications.
13
4.2 List of Modules
The first module tells about importing libraries to load the data such as
numpy, pandas, matplotlib and seaborn.
Secondly dataset is loaded and stored as matrix form using numpy and
manipulatio of those dataset is done in pandas. This is called data analysis.Using
matplotlib , bar chart representation is drawn to know numerical values prediction
for each transaction and seaborn is used to make graphical representation of data.
Next, dataset is splitted into train dataset and test dataset before applying
machine learning model.
14
CHAPTER 5
SYSTEM IMPLEMENTATION
In this chapter, pseudo code for implementation of Random Forest model is shown
to predict better accuracy.
#Create an object
Obj = RandomForestClassifier()
Using this object fit the training data to the random forest model
15
CHAPTER 6
16
xData, yData, test_size = 0.2, random_state = 42)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(xTrain, yTrain)
yPred = rfc.predict(xTest)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix
n_outliers = len(fraud)
n_errors = (yPred != yTest).sum()
print("The model used is Random Forest classifier")
f1 = f1_score(yTest, yPred)
print("The F1-Score is {}".format(f1))
17
6.2 SCREENSHOTS
18
6.2.2 CONFUSION MATRIX
19
CHAPTER 7
7.1 Conclusion
Credit Card Fraud Detection has emerged as major solution for the credit card fraud
problem in the electronic payment sector. we developed a novel method for fraud
detection, where customers are grouped based on their transactions and extract
behavioral patterns to develop a profile for every cardholder. Then different
classifiers are applied on three different groups later rating scores are generated for
every type of classifier. This dynamic changes in parameters lead the system to adapt
to new cardholder's transaction behaviors timely. Followed by a feedback
mechanism to solve the problem of concept drift. We observed that the Matthews
Correlation Coefficient was the better parameter to deal with imbalance dataset.
MCC was not the only solution. By applying the SMOTE, we tried balancing the
dataset, where we found that the classifiers were performing better than before. The
other way of handling imbalance dataset is to use one-class classifiers like one-class
SVM. We finally observed that Logistic regression, decision tree and random forest
are the algorithms that gave better results.
20
REFERENCES
[1] Jiang, Changjun et al. “Credit Card Fraud Detection: A Novel Approach Using
Aggregation Strategy and Feedback Mechanism.” IEEE Internet of Things Journal 5
(2018): 3637-3647.
[2] Pumsirirat, A. and Yan, L. (2018). Credit Card Fraud Detection using Deep Learning
based on Auto-Encoder and Restricted Boltzmann Machine. International Journal of
Advanced Computer Science and Applications, 9(1).
[3] Mohammed, Emad, and Behrouz Far. “Supervised Machine Learning Algorithms for
Credit Card Fraudulent Transaction Detection: A Comparative Study.” IEEE Annals of
the History of Computing, IEEE, 1 July 2018,
doi.ieeecomputersociety.org/10.1109/IRI.2018.00025.
[1] S. Akila and U. Srinivasulu Reddy, “Cost-sensitive Risk Induced Bayesian Inference
Bagging (RIBIB) for credit card fraud detection,” Journal of Computational Science, vol.
27, pp. 247–254, Jul. 2018, doi: 10.1016/j.jocs.2018.06.009.
[2] A. M. Ozbayoglu, M. U. Gudelek, and O. B. Sezer, “Deep learning for financial
applications : A survey,” Applied Soft Computing, vol. 93, p. 106384, Aug. 2020, doi:
10.1016/j.asoc.2020.106384.
[3] Y. Jin, R. M. Rejesus *, and B. B. Little, “Binary choice models for rare events data: a
crop insurance fraud application,” Applied Economics, vol. 37, no. 7, pp. 841–848, Apr.
2005, doi: 10.1080/0003684042000337433.
[1]. Fabiana Fournier, Ivo carriea, Inna skarbovsky, The Uncertain Case of Credit Card
Fraud Detection, The 9th ACM International Conference On Distributed Event Based
Systems(DEBS15) 2015.
[2]. Yashvi Jain, Namrata Tiwari, ShripriyaDubey, Sarika Jain, A Comparative Analysis of
Various Credit Card Fraud Detection Techniques, Blue Eyes Intelligence Engineering
And Sciences Publications 2019
[3]. Dinesh L. Talekar, K. P. Adhiya, Credit Card Fraud Detection System-A Survey,
International journal of modern engineering research(IJMER) 2014.
[1] Raj S.B.E., Portia A.A., Analysis on credit card fraud detection methods, Computer,
Communication and Electrical Technology International Conference on (ICCCET)
(2011), 152-156.
21
[2] Jain R., Gour B., Dubey S., A hybrid approach for credit card fraud detection using
rough set and decision tree technique, International Journal of Computer Applications
139(10) (2016).
[3] Dermala N., Agrawal A.N., Credit card fraud detection using SVM and Reduction of
false alarms, International Journal of Innovations in Engineering and Technology (IJIET)
7(2) (2016).
22